Shadow Transfer: Single Image Relighting For Urban Road Scenes

09/23/2019 ∙ by Alexandra Carlson, et al. ∙ University of Michigan 22

Illumination effects in images, specifically cast shadows and shading, have been shown to decrease the performance of deep neural networks on a large number of vision-based detection, recognition and segmentation tasks in urban driving scenes. A key factor that contributes to this performance gap is the lack of `time-of-day' diversity within real, labeled datasets. There have been impressive advances in the realm of image to image translation in transferring previously unseen visual effects into a dataset, specifically in day to night translation. However, it is not easy to constrain what visual effects, let alone illumination effects, are transferred from one dataset to another during the training process. To address this problem, we propose deep learning framework, called Shadow Transfer, that can relight complex outdoor scenes by transferring realistic shadow, shading, and other lighting effects onto a single image. The novelty of the proposed framework is that it is both self-supervised, and is designed to operate on sensor and label information that is easily available in autonomous vehicle datasets. We show the effectiveness of this method on both synthetic and real datasets, and we provide experiments that demonstrate that the proposed method produces images of higher visual quality than state of the art image to image translation methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In outdoor road environments, the appearance of a salient object, such as a car or pedestrian, is highly dependent upon the illumination of the scene. Adjusting the time of day can induce significant changes to the scene’s appearance despite the fact that its underlying structure and material properties remain the same. Illumination conditions at different times of day can alter scene and object appearance in two primary ways. First, the location and angle of incoming sunlight interacts with the physical structure of the scene to induce different distributions of cast shadows and shading within the scene. Second, the properties of sunlight can interact with the imaging sensor to alter perceived colors within the scene, for example, differing times of day can have different color temperatures and/or overall brightness. These “time of day” visual features have been shown to decrease performance of deep neural networks (DNNs) for a variety of vision-based detection, recognition, and segmentation tasks in outdoor driving scenarios [1, 2, 3]. In figure 1, the state-of-the-art DeepLab [4] semantic segmentation framework, trained on Cityscapes [5], is tested on images of the same scene under different times of day taken from the Oxford Robot dataset [6]

. The segmentation performance fluctuates drastically between images captured within hours of each other, particularly in the image regions where the distribution of shadows and brightness change significantly. The fact that sun position/time of day is source of prediction error for DNNs suggests that the network’s learned feature representations are dependent upon the illumination of the captured scene, which in turns indicates that real datasets do not have enough diversity in lighting conditions to allow the network to become invariant to such changes. Unfortunately, it is too costly to collect and label real datasets that capture a uniform distribution of lighting effects over the course of a day. While it is possible to generate datasets that capture a uniform representation of illumination effects like shadows using rendering pipelines and gaming engines 

[7, 8], synthetic datasets have a significant and undesirable domain gap from real data [8]. In contrast, we pose that, in order to improve the robustness of these neural network methods to varying illumination conditions, we need to more accurately and reliably model illumination changes in real, noisy images. While there has been impressive advances in image relighting within the graphics and vision communities [9, 10], these methods rely on having knowledge of material properties of objects in the scene, or having multiple views of the scene under the same illumination condition, neither of which is possible with driving datasets. The method proposed in this paper, which we dub “Shadow Transfer”, leverages the incredible success of image to image translation models to learn an illumination model via a deep neural encoder-decoder framework that operates upon input that is easily obtained from a car-mounted RGB camera. Furthermore, it is designed to be self supervised, removing the need for labeling illumination features in images, like shadows, brightness or global color temperature. To our knowledge, this is the first attempt at specifically relighting driving datasets by transferring lighting features. We anticipate that the proposed model, by adding in realistic illumination artifacts into images, is the first step towards understanding and eliminating the failure modes of detection and segmentation algorithms that result from changing lighting conditions.

Ii Related work

In this work, the objective is to transfer realistic lighting effects onto an image given a coarse model of scene geometry and a single RGB image. This process requires both removing and re-synthesizing illumination features within the image. The proposed Shadow Transfer method is based upon deep neural network architectures used for image-to-image translation, image relighting, shadow modeling, outdoor illumination estimation, and inverse graphics rendering. We briefly review the literature in these areas below.

Ii-a Illumination Estimation in images

In order to change the lighting of an image, it is often necessary to first estimate the current illumination conditions. This problem has been broken down into smaller areas, each focusing on the estimation of a specific illumination cue from an image or sequence of images. We briefly review the relevant methods in the following subsections.

Ii-A1 Inverse graphics and Intrinsic Images for Reflectance and Shading Estimation

Intrinsic image decomposition and inverse graphics rendering are methods that decompose an image into a set of intermediate representations or features that correspond to different physical process in the real world, e.g, normal maps, albedo, shading, and reflectance maps. These intermediate representations can be sampled to synthesize new unseen images, which allows for image relighting. There are drawbacks to these techniques that prevent them from being applied to large scale outdoor scenes. First, these methods make simplifying assumptions about the scene structure to make the reconstruction tractable, and thus are usually applied to scenes that contain similar spatial information, such as faces [11, 12, 13]. Second, they can require rigid scene priors [14, 15], such as lambertian reflection and/or shape models, or require complicated training regimes corresponding to complex synthetic datasets [16, 17], none of which generalizes to real world, complex outdoor scenes.

Ii-A2 Shadow-specific modeling in Images

The impact of shadows on scene understanding has motivated significant research in the development of both deep learning and hand-crafted techniques for shadow detection and removal  

[18, 19, 20, 21]. However, the majority of these methods are designed to produce accurate results on high quality images that contain simple scenes, specifically ones that contain only shadows cast upon a textured ground plane with good lighting conditions, typically require material and/or shadow labels, which are prohibitively expensive to generate for real datasets. The opposite problem to shadow detection and removal, shadow synthesis, has also been explored, but primarily in the context of adding synthetic objects into images  [22, 23, 24], but also require hard to obtain shadow or material labels, or rigid and hand crafted illumination priors that are difficult to generalize to noisy outdoor autonomous vehicle datasets. These design choices prevent these methods from generalizing to highly complex scenes that are captured in many autonomous vehicle datasets. In contrast, while the proposed method uses a multi-task structure inspired by [19], it is able to model both shadows and shading on scene structures without the need for shadow or material labels.

Ii-A3 Sun position Estimation and Modeling

Recent work applying deep learning techniques to sun location estimation have significantly higher performance than handcrafted features [25]

. In fact, convolutional neural networks trained for sun location prediction have been shown to learn feature representations that activate to and detect lighting cues that are similar to those that humans use to estimate the sun direction, e.g., shadow regions, bright regions, etc 

[25, 26]. Furthermore, using sun location prediction as an error function as an additional loss has been shown to improve performance for visual tracking and odometry [25, 26, 27]. The proposed method uses the same intuition by constructing a sun estimation loss that helps transfer illumination specific information into the input RGB image.

Ii-B Single and Multi-Image Relighting for real data

The goal of image relighting is to change the existing illumination condition captured in an image or set of images to a target illumination condition. At their core, image relighting techniques attempt to model the light transport function, which is composed of a model of scene lighting and the BRDF (bidirectional reflectance density function) that describes the material properties throughout the scene [28, 10]. However, to model this function the majority of deep relighting methods require multiple images (anywhere between 5-1000 depending on the image relighting method) of the same scene under different lighting conditions to generate an accurate estimate of the lighting function  [10, 9], and/or material labels of objects in the scene [23], neither of which is easy to obtain for outdoor driving datasets. In contrast, the proposed relighting method is self-supervised, and designed to operate on inputs that are readily available for such datasets.

Ii-C Image to Image Translation for Illumination Adaptation

Many works have cast the image relighting problem as a domain adaptation problem, and successfully used image to image translation methods to transfer images from the ’daytime’ domain to the ’night time’ domain [29, 30]. This suggests that these types of models may have an element to their design that can naturally capture information relevant to scene lighting. However, these methods become unstable and intractable when extending to multiple domains that would be necessary to capture transitions between multiple ’illumination condition/lighting’ domains. There are several works that extend state-of-the-art domain to domain transfer methods to multi-domain to multi-domain transfer [31, 32], but as far as we are aware, have not been applied to the particular problem of relighting. Furthermore, the outputs from image to image translation methods do not match the visual quality of real world data due to rendering artifacts introduced into the translated images that are not realistic or physically-based. Due to the highly unconstrained nature of these methods, it is not clear exactly what physical processes the model may be learning, and as a result there is no way to transfer specific subsets of the learned visual effects into images. In contrast, the proposed method avoids casting each light source location as a separate domain, and instead constrains the learned feature spaces of the Shadow Transfer framework through illumination-based losses and encoding networks.

Fig. 2: The Shadow Transfer Network architecture. We break the domain transfer task into two steps: first, the luminace channel from the CIELab color space is predicted from the depth, semantic segmentation map, and light source location. Second, the chrominance channels, ab, are predicted from the predicted luminance channel and light source location. The luminance and chrominance predictions are concatenated together and transformed into RGB space to generate the final model output.

Iii Methods

For autonomous vehicle driving datasets, typically we have access to RGB camera, depth sensor and GPS output. Pretrained, state of the art segmentation networks can also be used to generate semantic maps from the RGB video input, which captures information regarding objectness and material groupings within the scene. The proposed Shadow Transfer network is designed to operate on these intermediate representations to perform image relighting. The complete architecture for the proposed Shadow Transfer network is given in Figure 2. We define the scene lighting conditions as the light source location in the scene, which for outdoor images would be the sun azimuth and zenith angles. We choose this parameterization because altering the light source location in an image can simultaneously changes the illumination effects that have shown to be problematic for driving tasks, specifically shadow distributions, color temperature and brightness.

Iii-a Shadow Transfer Architecture

We adopt an framework that consists of two stacked encoder-decoder neural networks to perform two tasks: first, luminance prediction, and then second chrominance prediction. We choose to use the channels of CIELab color space representation for both tasks due to its success in shadow edge detection tasks [24, 33].
Luminance Prediction Network
For luminance prediction, we use a U-net [34] inspired encoder-decoder neural network to predict the luminance channel of the ground truth RGB image converted into CIELab space. As shown in the top of Figure 2

, this network is comprised of three sub-networks: a geometry encoder that takes in the concatenated semantic segmentation and depth maps, a light source location encoder that takes in the location of the light source in the scene and projects it into the geometry encoder’s latent space, and a luminance decoder that operates on the concatenated light source latent vector and geometry latent vector.


Chrominance Prediction Network

For chrominance prediction, we use a similar framework: a U-net inspired encoder-decoder framework that is comprised of a luminance encoder that takes in the predicted luminance channel, a light source location encoder that takes in the location of the light source in the scene and projects it into the luminance encoder’s latent space, and a chrominance decoder that operates on the concatenated light source latent vector and the luminance latent vector to generate the chrominance channel predictions. Note that this is a similar process to the unsupervised colorization method proposed in 

[35].
Illumination-constrained latent space learning
In classic image-to-image translation networks, an encoder neural network compresses and distills visual information into a latent space representation, which a decoder network then projects into an image. Ideally, each dimension (or subset of dimensions) of the latent space representation corresponds to a particular visual feature within an image, and thus altering the values of the latent dimension alters the visual feature in the output image. However, in practice the learned latent spaces of image-to-image translation networks trained on complex scenes are not interpretable or necessarily smooth and easy to sample [12]. To transfer only illumination effects, we proposed to externally enforce a subset of the latent space to correspond correspond to light source location, similar to  [12]. This is achieved by injecting light source information into the luminance and chrominance latent spaces using the two separate light source encoders that are described above. This design choice allows the two networks to learn which light source locations correspond to the shadow/shading distributions (based upon the extracted geometric information) as well as coloration that they induce in real images.
Stacking Encoders for Multitask learning
The motivation behind separating the tasks of luminance and chrominance prediction into a feed-forward multi-task framework stems from its success of in joint shadow detection and removal task [19, 9]. Modularizing and stacking the two tasks so that the output of one task is used as input to the other allows each network to focus on a single task at a time, reducing the complexity of the information that needs to be learned. Since the prediction occurs in two different stages, it allows each network to share mutual improvements through forward/backward information flows. This multitask framework specifically lends itself to luminance and chrominance prediction, primarily due to the co-incident nature of shadow and reflectance edges between the L and a channels, as well as the relationship between the ‘yellowness’ of sunlight and respective ‘blueness’ color of shadows within the b channel [33]. This means that inaccurate predictions of the L channel would illicit inaccurate predictions of other illumination effects in the a and b channels, and this chrominance error would be back propagated into the luminance encoder-decoder network, providing a better training signal than luminance error alone.

Iii-B General Training procedure

The Shadow Transfer framework is designed to use a self-supervised training paradigm. First, to ensure local semantic/structural consistency in each predicted image, we use apply the standard L1 loss to the predicted L channel and the predicted ab channels. This is a self-supervised loss because the “labels”, which are the ground truth CIELab channels, are obtained by merely computing the colorspace conversion on the RGB image. Since the family L norm losses are notorious for poorly reconstructing high frequency image information in local pixel neighborhoods, we use the standard perceptual and style loss proposed in [36] on the predicted RGB images, using the feature spaces of the VGG16 [37]

architecture pretrained on Imagenet 

[38]. However, none of these losses directly target the illumination effects that we want altered in the image. Using the impressive results from [25]

, which demonstrated that the neurons/layers in a CNN trained for sun location estimation in a single RGB image learned to activate in response to shadow and brightness regions in the image. We pretrain a VGG16 network (initialized with weights pretrained for scene classification on the Places-365 dataset) to perform sun location estimation on our ground truth RGB images. We use an L2 loss to train this network using the ground truth light source locations. We refer to this feature loss network as SunEst-CNN. We then fix the weights of this sun estimation network, and during the training of the Shadow Transfer neural network, use these ’illumination’ feature spaces to calculate an illumination-focused perceptual loss on the predicted RGB images. This is also a self-supervised loss, because it only requires the calculation of the lightsource location from the GPS and timestamps of the training dataset. The total loss use to train the Shadow Transfer network is the sum of these four losses.

Iv Experiments

Fig. 3: This figure is best viewed in color in the web version. The above are example outputs from the proposed Shadow Transfer method (second column) in comparison to the state-of-the art baselines, ComboGAN (third column) and StarGAN (fourth column). Each row is the generated output image for a specified scene and light source location from the test set, given as a two element sun azimuth and zenith vector.
Model Type
[-60.0, 60.0]
[80.0, 30.0]
[-80.0, 30.0]
[60.0, 60.0]
[0.0, 75.0]
[-95.0, 10.0]
[-15.0, 80.0]
[15.0, 80.0]
[95.0, 10.0]
Proposed
Method
0.770 0.704 0.819 0.721 0.770 0.803 0.773 0.769 0.810
ComboGAN LABEL: 0.662 0.602 0.657 0.674 0.780 0.721 0.768 0.766 0.699
StarGAN LABEL: 0.334 0.329 0.339 0.292 0.330 0.375 0.327 0.330 0.358
TABLE I: Comparisons to state of the art multi-domain to multi-domain transfer methods
Shadow Transfer
Net Components
[-60.0, 60.0]
[80.0, 30.0]
[-80.0, 30.0]
[60.0, 60.0]
[0.0, 75.0]
[-95.0, 10.0]
[-15.0, 80.0]
[15.0, 80.0]
[95.0, 10.0]
Depth
Input
Sem. Seg.
Input
Sun Est.
Loss
0.776 0.707 0.828 0.732 0.780 0.810 0.782 0.780 0.817
0.770 0.704 0.819 0.721 0.770 0.803 0.773 0.769 0.810
0.721 0.673 0.760 0.686 0.721 0.760 0.724 0.719 0.756
0.762 0.686 0.803 0.714 0.763 0.792 0.765 0.762 0.799
TABLE II: Ablation experiments of different inputs and network components for CARLA-sun. The descriptor ‘Sun Est. Loss’ indicates if the Shadow Transfer network was trained with the novel Sun-CNN feature loss. The descriptor ‘Depth Input’ indicates if the Depth map was used as input. Similarly, the descriptor ‘Sem. Seg. Input’ indicates if the semantic segmentation map was used as input.
Fig. 4: This figure is best viewed in color in the web version of the paper. Shown above is the set of qualitative ablation experiments on the KITTI-sun dataset. The original RGB image, Depth and Semantic Segmentation map inputs are provided in the first column in the blue dashed box. In the second column are the injected light source locations used to relight the original RGB image. The third through sixth columns are the outputs of the Shadow Transfer network when trained with either the full proposed model, the proposed model with no Sun-CNN feature loss, the proposed model with only Depth input, and the proposed model trained only with Semantic segmentation input, respectively.

To validate the efficacy of our method, we present qualitative and quantitative evaluations on both real and synthetic datasets. The proposed Shadow Transfer network is most similar in design and goal to Multi-Domain to Multi-Domain transfer methods. Therefore, we compare the performance of the proposed method to the two state-of-the-art models StarGAN [32] and ComboGAN [31].

Since our goal is to generate images taken under lighting conditions/sun positions that don’t necessarily exist in a given dataset, we cannot use quantitative metrics to validate the quality of generated images against ground truth for real image datasets. Therefore, we have designed a synthetic driving dataset using the CARLA rendering engine [39], which we refer to as CARLA-sun. Unlike real data, this dataset has a fixed number of light source locations, and contain the same scene under each possible lighting condition. To generate the dataset, we perform the same 100 vehicle trajectories under different sun positions defined by the sun azimuth and zenith angles. There are 9 total sun postions that correspond to the sun location at different times of day. There are a total of 13647 RGB, depth and semantic segmentation pairs, see the first row of Figure 3 for examples of the ground truth images. We also generated a held out test dataset of a single vehicle trajectory consisting of 424 images. We perform all synthetic experiments on this held out set. To test the proposed method on real data, we present results using the KITTI raw dataset [40]. We use the same method as  [25] to generate the sun position labels for each image in the dataset. We discretize the possible sun locations by rounding the azimuth and zenith to the nearest ten, which helps improve the latent space learning. To demonstrate the self-supervised nature of the method, we use a pretrained state of the art monocular depth estimation network, Monodepth [41] to generate depth labels for each real image. Similarly we use the state of the art semantic segmentation network DeepLab [4] pretrained on Cityscapes [5] to generate coarse semantic segmentation labels for each real images. We refer to this dataset as KITTI-sun, it has a total of 22400 RGB, depth, semantic segmentation, sun location pairs. For the CARLA-sun synthetic dataset, we present ablation experiments to determine which of the Shadow Transfer network components yields the highest output image quality relative to the ground truth images. For KITTI-sun, we present qualitative examples to determine if the components contribute to the perceived visual quality of the output images.

Iv-a Training parameters and regimes

For both the synthetic and real datasets, we train the Shadow Transfer networks for 50 epochs with a learning rate of 2e-4 and a batch size of 2. To accomodate the requirement of the Unet encoder-decoder, the input images are resized to 512x512. The SunEst-CNN networks are initialized using the weights of a VGG-16 network trained to perform scene classification on Places-365. They are trained for 20 epochs at a learning rate of 1e-5, batch size of 2. The input images are resized to 256x256 to accomodate the architecture requirements. The two state of the art multidomain to multidomain transfer methods, ComboGAN and StarGAN, are trained on

CARLA-sun

using the training hyperparameters given in the paper and respective github repositories. All networks were train on a single Titan X GPU.

Iv-B Evaluation on CARLA-sun

Iv-B1 Comparison to Multi-Domain to Multi-Domain adaptation methods

We present a qualitative, visual comparison between the proposed method and state-of-the-art methods in Figure 3. ComboGAN is able to capture realistic lighting, specifically shadows and color temperature, for each light source condition. However, it introduces translation artifacts that degrade the overall photorealism of the output image. StarGAN is only able to realistically capture varying global color temperature between the different light source domains. In contrast, the proposed method is able to capture realistic shading, shadows and color temperature without introducing the same artifacts as ComboGAN. Note that the proposed method appears to not match the brightness of the scene correctly in comparison to the ground truth, and also produces blur artifacts, as shown in the right hand side of Figure 3. These observations are verified in our quantitative analysis of the perceived generated image quality in comparison to ground truth, presented for each light source location in Table I. For a held out test trajectory from the CARLA-sun dataset, we calculate the average mean structural similarity metric (MSSIM) for each light source location for the proposed method and baselines. StarGAN has the worst perceptual realism. We suspect this is because it was designed to operate on simpler scenes, specifically faces. Our model outperforms ComboGAN for each light source. As observed in Figure 3, we suspect that this is because ComboGAN induces the same visual artifacts that plague the majority of image to image translation models trained to transform from RGB to RGB spaces. In contrast, the proposed method transforms from a geometric representation to an RGB representation, which yields a better mapping between visual features of the two spaces.

Iv-B2 Ablation Experiments

To evaluate each component of our network, we calculate the mean MSSIM between the prediction and ground truth CARLA-sun image for each possible light source location on the same held out sequence used above. Interestingly, it appears that having both the depth and segmentation map as input is more important to the predicted image quality than the SunEst-CNN feature loss. This suggests that the lighting conditions within the CARLA gaming system are not complex enough to need the extra constraint of the SunEst-CNN loss to accurately model within the Shadow Transfer framework.

Iv-C Evaluation on KITTI-sun

Since we are missing ground truth scenes for the relight real images, in this section we present a qualitative evaluation of the relight image quality. A set of qualitative ablation experiments are presented in Figure 4. When comparing the relight images generated by the full model of the proposed method to the model trained with out the SunEst-CNN feature loss, we see that there is no illumination feature transfer. This highlights the importance of the illumination-based SunEst-CNN feature loss during the training procedure for real data. Both the depth-only and semantic segmentation-only models are noticeably more noisy and blurry. The segmentation only model appears to better capture the changing shadow distributions throughout the scene and illumination features in general, where as the depth-only model appears to better capture realistic pixel statistics. In general, we observe that the proposed Shadow Transfer network has difficulty capturing local, high frequency noise. We suspect that this is due to the coarseness of the input depth and semantic segmentation masks.

V Discussion and Conclusions

In this work, we proposed a single image relighting framework that is able to successfully transfer illumination effects for both synthetic and real images. Our results indicate that the proposed Shadow Transfer framework generates more realistic images than the state-of-the-art multi-domain to multi-domain transfer methods. Future work would be to apply advancements in the super pixel literature to this architecture to improve the high frequency noise modeling in the generated images. Other avenues/extensions of this work would be to incorporate a weather and imaging pipeline models into this framework to better capture all of the myriad influences of illumination that impact image appearance.

References

  • [1] N. Alshammari, S. Akcay, and T. P. Breckon, “On the impact of illumination-invariant image pre-transformation for contemporary automotive semantic scene understanding,” in 2018 IEEE Intelligent Vehicles Symposium (IV).   IEEE, 2018, pp. 1027–1032.
  • [2] F. Bahri, M. Shakeri, and N. Ray, “Online illumination invariant moving object detection by generative neural network,” arXiv preprint arXiv:1808.01066, 2018.
  • [3] M. S. Ramanagopal, C. Anderson, R. Vasudevan, and M. Johnson-Roberson, “Failing to learn: Autonomously identifying perception failures for self-driving cars,” CoRR, vol. abs/1707.00051, 2017. [Online]. Available: http://arxiv.org/abs/1707.00051
  • [4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
  • [5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2016, pp. 3213–3223.
  • [6] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 year, 1000 km: The oxford robotcar dataset,” The International Journal of Robotics Research, vol. 36, no. 1, pp. 3–15, 2017.
  • [7] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground truth from computer games,” in European Conference on Computer Vision (ECCV), ser. LNCS, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., vol. 9906.   Springer International Publishing, 2016, pp. 102–118.
  • [8] M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, K. Rosaen, and R. Vasudevan, “Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks?” arXiv preprint arXiv:1610.01983, 2016.
  • [9] J. Philip, M. Gharbi, T. Zhou, A. A. Efros, and G. Drettakis, “Multi-view relighting using a geometry-aware network,” ACM Transactions on Graphics (TOG), vol. 38, no. 4, p. 78, 2019.
  • [10] Z. Xu, K. Sunkavalli, S. Hadap, and R. Ramamoorthi, “Deep image-based relighting from optimal sparse samples,” ACM Transactions on Graphics (TOG), vol. 37, no. 4, p. 126, 2018.
  • [11] J. Chen, J. Konrad, and P. Ishwar, “A cyclically-trained adversarial network for invariant representation learning,” arXiv preprint arXiv:1906.09313, 2019.
  • [12] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow, “Interpretable transformations with encoder-decoder networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5726–5735.
  • [13] M. Wang, Z. Shu, S. Cheng, Y. Panagakis, D. Samaras, and S. Zafeiriou, “An adversarial neuro-tensorial approach for learning disentangled representations,” International Journal of Computer Vision, vol. 127, no. 6-7, pp. 743–762, 2019.
  • [14] J. T. Barron and J. Malik, “Intrinsic scene properties from a single rgb-d image,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 17–24.
  • [15] ——, “Shape, illumination, and reflectance from shading,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 8, pp. 1670–1687, 2014.
  • [16] C. Donahue, Z. C. Lipton, A. Balsubramani, and J. McAuley, “Semantically decomposing the latent spaces of generative adversarial networks,” arXiv preprint arXiv:1705.07904, 2017.
  • [17] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum, “Deep convolutional inverse graphics network,” in Advances in neural information processing systems, 2015, pp. 2539–2547.
  • [18] Q. Zheng, X. Qiao, Y. Cao, and R. W. Lau, “Distraction-aware shadow detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5167–5176.
  • [19] J. Wang, X. Li, and J. Yang, “Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1788–1797.
  • [20] L. Qu, J. Tian, S. He, Y. Tang, and R. W. Lau, “Deshadownet: A multi-context embedding deep network for shadow removal,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4067–4075.
  • [21] Y. Xiao, E. Tsougenis, and C.-K. Tang, “Shadow removal from single rgb-d images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3011–3018.
  • [22]

    S. Zhang, R. Liang, and M. Wang, “Shadowgan: Shadow synthesis for virtual objects with conditional adversarial networks,”

    Computational Visual Media, vol. 5, no. 1, pp. 105–115, 2019.
  • [23] H. A. Alhaija, S. K. Mustikovela, A. Geiger, and C. Rother, “Geometric image synthesis,” in Asian Conference on Computer Vision.   Springer, 2018, pp. 85–100.
  • [24] J.-F. Lalonde, A. A. Efros, and S. G. Narasimhan, “Estimating the natural illumination conditions from a single outdoor image,” International Journal of Computer Vision, vol. 98, no. 2, pp. 123–145, 2012.
  • [25] V. Peretroukhin, L. Clement, and J. Kelly, “Reducing drift in visual odometry by inferring sun direction using a bayesian convolutional neural network,” in 2017 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2017, pp. 2035–2042.
  • [26] W.-C. Ma, S. Wang, M. A. Brubaker, S. Fidler, and R. Urtasun, “Find your way by observing the sun and other semantic cues,” in 2017 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2017, pp. 6292–6299.
  • [27] L. Clement, V. Peretroukhin, and J. Kelly, “Improving the accuracy of stereo visual odometry using visual illumination estimation,” in International Symposium on Experimental Robotics.   Springer, 2016, pp. 409–419.
  • [28] X. Sun, K. Zhou, Y. Chen, S. Lin, J. Shi, and B. Guo, “Interactive relighting with dynamic brdfs,” ACM Transactions on Graphics (TOG), vol. 26, no. 3, p. 27, 2007.
  • [29] A. Anoosheh, T. Sattler, R. Timofte, M. Pollefeys, and L. Van Gool, “Night-to-day image translation for retrieval-based localization,” arXiv preprint arXiv:1809.09767, 2018.
  • [30] D. Sakkos, E. S. Ho, and H. P. Shum, “Illumination-aware multi-task gans for foreground segmentation,” IEEE Access, vol. 7, pp. 10 976–10 986, 2019.
  • [31] A. Anoosheh, E. Agustsson, R. Timofte, and L. Van Gool, “Combogan: Unrestrained scalability for image domain translation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 783–790.
  • [32] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “Stargan: Unified generative adversarial networks for multi-domain image-to-image translation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8789–8797.
  • [33] E. A. Khan and E. Reinhard, “Evaluation of color spaces for edge classification in outdoor scenes,” in IEEE International Conference on Image Processing 2005, vol. 3.   IEEE, 2005, pp. III–952.
  • [34] U-net: Convolutional networks for biomedical image segmentation.   Springer, 2015.
  • [35] Colorful image colorization.   Springer, 2016.
  • [36]

    J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in

    European Conference on Computer Vision.   Springer, 2016, pp. 694–711.
  • [37] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. [Online]. Available: http://arxiv.org/abs/1409.1556
  • [38] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on.   Ieee, 2009, pp. 248–255.
  • [39] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: An open urban driving simulator,” arXiv preprint arXiv:1711.03938, 2017.
  • [40] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.   IEEE, 2012, pp. 3354–3361.
  • [41] Unsupervised monocular depth estimation with left-right consistency, 2017.