Spatially-Varying Outdoor Lighting Estimation from Intrinsics

04/09/2021 ∙ by Yongjie Zhu, et al. ∙ 10

We present SOLID-Net, a neural network for spatially-varying outdoor lighting estimation from a single outdoor image for any 2D pixel location. Previous work has used a unified sky environment map to represent outdoor lighting. Instead, we generate spatially-varying local lighting environment maps by combining global sky environment map with warped image information according to geometric information estimated from intrinsics. As no outdoor dataset with image and local lighting ground truth is readily available, we introduce the SOLID-Img dataset with physically-based rendered images and their corresponding intrinsic and lighting information. We train a deep neural network to regress intrinsic cues with physically-based constraints and use them to conduct global and local lightings estimation. Experiments on both synthetic and real datasets show that SOLID-Net significantly outperforms previous methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Estimating outdoor lighting from a single image is one of the fundamental problems in computer vision. By providing outdoor scene properties from the physical aspect, it has huge impact on many applications,

e.g.

, face/body relighting, scene understanding, augmented reality (AR), and so on. This task is rather challenging since images are formed by conflating lighting with complex surface reflectance distribution and object geometry. In the outdoor scenario, existing solutions usually employ low-dimensional parametric models such as the Hošek-Wilkie (HW) sky model 

[outdoorlighting_param]

with four parameters to fit the sky illumination. The capacity of parametric models is not sufficient to represent the complex real-world illumination, and a recent non-parametric approach using an autoencoder to learn the sky illumination model from a large-scale sky panorama dataset and encoding the lighting information from a single limited Field-of-View (FOV) image shows more promising results 

[deepsky].

Figure 1: Given a single low-dynamic-range (LDR) image with limited FOV and a location in pixel coordinate (marked by numbers), SOLID-Net, for the first time, infers a panoramic HDR illumination map representing the light arriving from all directions at the location. Note that the global environment map (could be estimated using existing method [deepsky]) is only able to cover a small part of the local lighting (red contours).

However, as far as we know, all existing outdoor lighting estimation methods [outdoorlighting_param, deepsky, allweather] only consider the outdoor illumination as a single global map without any spatially-varying consideration, i.e., the light probe is surrounded by an environment map that casts rays from infinitely far away. A spatially-varying lighting estimation has proved to be successful in indoor scenarios, which is achieved by modeling local indoor lighting using low-frequency parametric lighting represented by spherical harmonics (SH) [indoorlocallighting_SH, Barron2013] or panoramic environment map [indoorlocallighting_pano].

Extending spatially-varying lighting estimation from indoor to outdoor is non-trivial in three aspects: 1) The extremely high-dynamic-range (HDR) sunlight and the complicated sky light under different weather conditions make it more difficult to parameterize outdoor than indoor lighting [indoorlocallighting_SH, Barron2013], while the existing non-parametric sky model [deepsky]

treats it as a pure deep learning task without considering physics image formation constraint. 2) Non-parametric spatially-varying local lighting estimation is highly ill-posed, since different 3D locations should have different lighting observations and the majority of local observation is missing 

[indoorlocallighting_pano]. 3) HDR and panoramic images capturing local lighting and geometry information in outdoor are not yet available, despite there are many datasets for such a purpose in the indoor scenario by synthetically generating scenes from SUNCG [suncg] and Matterport3D [Matterport3D].

In this paper, we propose SOLID-Net, a neural network for Spatially-varying Outdoor Lighting estimation using cues from Intrinsic image Decomposition, as shown in Figure 1. We tackle the three major challenges mentioned above by proposing a two-stage framework: 1) We train a single-in-multi-output CNN to decompose an input image into intrinsic parts: albedo (material-related), normal and plane distance (geometry-related), and shadow (lighting-related). These intrinsics provide a physically-based shading constraint by fitting SH-represented global lighting with low-frequency information, which is then combined with extracted sky features from the input image to generate a non-parametric sky model like [deepsky]. 2) With the estimated geometry from decomposed intrinsics, we further warp the input image and estimated shadow map with limited FOV to a spherical projection centered at the target location, which provides panoramic observation to reduce the ill-posed issue. This is then combined with global sky lighting from the previous step as input to train a multi-input-single-output CNN for complementing high-frequency local lighting estimation. 3) We use the Blender SceneCity [scenecity] to create city models that contain a large set of outdoor scenes and render a synthetic outdoor lighting estimation dataset with labeled location information and corresponding lighting effects using a physically-based path-tracer to facilitate the training of our network. SOLID-Net demonstrates significant improvements over other methods by making contributions in

  • integrating shading constraint from intrinsic decomposition into the global sky lighting estimation;

  • producing high-frequency local lighting estimation via panoramic warping and shadow map reference; and

  • building the first spatially-varying outdoor lighting estimation dataset with ground truth labels.

Figure 2: Pipeline of data generation and filtering for creating SOLID-Img dataset with physically based rendering.

2 Related Work

Outdoor lighting estimation. Stumpfel et al. [paul04] proposed to explicitly capture the HDR outdoor lighting environments that include the sun and sky with multiple exposures. Lalonde et al. [lalonde12] first proposed lighting estimation from a single, generic outdoor scene. Their approach relied on multiple cues (such as shadows, shading, and sky appearance variation) extracted from the image. There are solutions using parametric models to represent outdoor lighting: Cheng et al. [cheng2018] estimated lighting from the front and back camera of a mobile phone. However, they represented lighting using low-frequency SH, which does not appropriately model outdoor lighting. Hold-Geoffroy et al. [outdoorlighting_param] learned to estimate Hošek-Wilkie (HW) sky model parameters from a single image, which is further extended by Zhang et al. [allweather] with a more flexible parametric Lalonde-Matthews (LM) sky model. To include more information of the sky, Hold-Geoffroy et al. [deepsky] designed an autoencoder to learn a non-parametric sky model from a large sky panorama dataset [skydatabase] and trained a network to learn the sky lighting from limited-FOV images. LeGendre et al. [legendre19] used a mobile phone camera with three different reflective spheres to capture lighting ground truth and used these data to train their deep model effectively, but these spheres are still global lighting probes.

Local lighting estimation. A direct way of obtaining the local lighting of an environment is to capture the lighting intensity at a target location using a probe of known shape. Debevec et al. [paul98] showed that HDR environment maps can be captured with a reflective metallic sphere captuerd with the scene. Barron an Malik [Barron2013] decomposed the scene into intrinsic components including spatially-varying SH-based lighting, but it required an RGBD image as input and relied on hand-crafted priors. To learn local lighting representation, Garon et al. [indoorlocallighting_SH] predicted fifth-order SH coefficients from an input image and local patches with synthetic data. In more recent progress, Li et al. [indoorlighting_inverserender] proposed a dense spherical Gaussian lighting representation with differentiable rendering to conduct scene editing. But all the methods mentioned above only considered indoor parametric lighting and are hard to be extended to outdoor lighting. Song et al. [indoorlocallighting_pano] proposed a cascaded model (denoted as NeurIllum for brevity) to recover high-frequency local lighting with warped color image according to recovered geometry, which showed promising texture details but the lighting positions are sometimes less accurate due to the loss of massive information in the panorama.

3 Dataset

A large dataset containing HDR images and their corresponding illumination measured at different locations in a scene is required to learn to estimate outdoor intrinsics and local lightings. Existing outdoor panorama datasets, such as [deepsky, ldr2hdr] only provide a single global illumination map assuming distant lighting, which cannot be used to learn local lighting estimation. To provide training data for solving “SOLID” problem, we introduce the SOLID-Img, a dataset for Spatially-varying Outdoor Lighting estimation with ground truth Intrinsic Decomposition labels and a large amount of rendered Images as shown in Figure 2.

3.1 Data Generation

We adopt 3D city models from the Blender SceneCity [scenecity] to create synthetic scenes. In Blender SceneCity, there are 450 unique objects in 80 material categories. The object models provide surface materials, including diffuse albedo, roughness, and transparency, which are used to obtain photo-realistic renderings.

Camera setting. For each road block, we select a set of cameras with diverse views seeing most objects in the context, to provide comprehensive information for lighting estimation, as shown in Figure 2(a). Our process starts by selecting the “best” camera [physicallyrender] for each of the six horizontal view direction sectors in every road block. For each sector, we select the view with the highest percentage pixel coverage according to item buffer, as long as it has more than three object categories111More details are in the supplementary material..

Scene rendering. We collect 70 HDR environment maps from HDRI Haven [hdrihaven] which cover different solar zenith angles from sunrise to sunset. To simulate different sunlight directions, we rotate each HDR environment map along the latitude direction with a random angle sampled uniformly in . Then we render images using the camera settings above and these HDR environment maps with the physically-based Blender Cycles rendering engine [blender], to generate photo-realistic renderings. The resolution is set as with a physically-based path tracer of 512 samples. We record the material buffer (diffuse albedo buffer, normal buffer, depth buffer) as intermediate ground truth. We represent 3D geometry using the surface normal and plane distance, and render both as suggested in [im2pano3d]. To render shadows, we set the whole scene as a single Lambertian material and render it twice with shadow turned on and off respectively, from which shadow maps are calculated by checking the difference, as shown in  Figure 2(b).

Local lighting collections. To obtain the ground truth of global lighting, we save the rotated environment maps with solution. To collect local lighting, we randomly sample 4 locations in the scene to render 4 local light probes. The image is split into 4 quadrants, and a random 2D coordinate is sampled uniformly in each quadrant (excluding the sky part and 5% pixels near the image boundary). The 3D centers of local cameras are calculated by casting a ray from the camera recording the scene to the surface of the scene and getting the first intersection point. From that point, we move the local camera center cm away along the plane surface normal to prevent large invalid pixels and render a local light probe at this position with resolution. All local light probes are rendered in the equirectangular representation, as shown in  Figure 2(c).

3.2 Data Filtering

Inspired by [physicallyrender], we remove low-quality renderings that are with different color distribution with natural images, e.g., with overly low or high intensities. To obtain a prior color distribution on real images, we compute normalized color histogram for 1100 selected real images from the Google Street View Dataset [googlestreetview]. For each rendered image, we calculate the histogram similarity with those from Google Street View as the sum of minimal value of each bin; and then we assign it with a score calculated as the largest histogram similarity by comparing it to all real images; finally, we select all images with color similarity score larger than , as shown in Figure 2(d). This process selects 38000 images from the initially rendered image set, composing the SOILD-Img dataset. Then care is taken to split the dataset into the train/test set according to different lighting conditions.

Figure 3: Pipeline of SOLID-Net. Stage 1: I-Net takes input image as input and estimate the intrinsic parts: , , , and intermediate , then is decoded with sky features to generate . The recovered geometry from and is used to warp and into panoramic images and around an input location . Stage 2: P-Net takes in warped images and to predict a high-frequency HDR spatially-varying lighting. The whole network is trained in an end-to-end manner.

4 Method

This section introduces the design methodology of SOLID-Net whose pipeline is shown in Figure 3. It is a two-stage framework that learns to reconstruct locally HDR outdoor environment maps, trained with the SOLID-Img dataset introduced in Section 3.

4.1 Problem Formulation

We formulate illumination estimation as a regression problem. Given an LDR image with limited FOV and a selected pixel location “” in homogeneous coordinate , our model outputs an HDR illumination centered around the 3D location of the pixel “” and a global sky environment HDR illumination , where both and are represented as a panoramic image with full FOV.

4.2 Network Architecture

A straightforward approach to estimating the outdoor lighting from the scene would be to simply take the single limited FOV image as input, encode it into a feature map using a CNN, and feed the feature map into a lighting-regression sub-network [lalonde17, deepsky]. Unsurprisingly, we find it results in outdoor lighting estimation with higher error (see Figure 6), presumably because it is difficult for the network to understand how to extract full FOV lighting from a limited FOV image. One way to improve it is to bring regularization from the Lambertian rendering equation [inverserendernet], which however is challenging for outdoor spatially-varying lighting estimation because: 1) Outdoor scenes have large areas of shadow occlusion which cannot be directly fitted by the Lambertian model. 2) SH lighting has a limited dynamic range and it is too smooth to represent sharp sky lighting and detailed texture. Therefore, we propose a two-stage framework to jointly solve these problems by: 1) proposing an intrinsic image decomposition network (denoted as I-Net) that takes a limited FOV image as input and estimates its intrinsic components as well as a global sky environment map and 2) designing a panoramic completion module (denoted as P-Net) that estimates local lighting from outputs in the previous stage and the input location.

Figure 4: An example of intrinsic decomposition results using our SOLID-Img test dataset. Given an input image, our estimated albedo, normal, plane distance, shadow, and shading show close appearance to the ground truth (shown as insets).
Figure 5: An example of panoramic warping. By using the estimated geometry-related intrinsics, we warp the observed image into panorama coordinates according to the input pixel location. (Please zoom-in for details.)

I-Net. As shown by blue blocks in Fig 3, I-Net takes a single limited-FOV LDR image as input and various outputs including diffuse albedo , surface normal , plane distant map  [im2pano3d], shadow map , and second-order SH coefficients , and the sky environment map generated from SH coefficients and sky features. We use a single encoder to capture global features of intrinsic information, and then use five decoders for , , , , and

followed by a decoder lighting branch for sky environment map regression. Skip links are used for preserving details. In particular the sky map regression, we use a fully-connected (FC) layer to process the output feature maps of lighting branch encoder to generate a latent vector of size 27 (second-order SH in RGB). For the decoder, we reshape this vector and upsample it 8 times, and then combine it by flipping padded sky features to generate a 256

128 HDR sky environment map. The lighting information encoded in can be considered as the low-frequency form of

, and it is used to guide the recovery of the high-frequency sky environment maps with sky features extracted from the input images. In summary, I-Net predicts intrinsic components (examples are provided in Figure 

4) and sky environment map:

(1)

P-Net. The 3D location is calculated by the predicted normal vector in and plane distance in by I-Net for each pixel. If the camera intrinsic matrix is fixed as and 2D pixel locations of the whole image are provided, we can reproject them as a 3D scene by , where .

By using we warp the input image and estimated shadow map and spatially align them with the output local lighting to provide panoramic observation. First, we compute the camera location according to the input point position and apply cm translation (defined in Section 3) along the normal direction of the supporting plane to align with training data. Second, we perform a panoramic warping through a forward projection using the estimated geometry and camera location to map pixels in and as panoramic images (an example is provided in Figure 5). The Z-buffer is computed to discard invisible points and the points without projected positions are set to 0.

Since the local lightings share the same camera rotation, the sky parts in local lighting should be consistent, this motivates us to take the sky as an input to P-Net. As shown by orange blocks in Fig 3, P-Net concatenates the two incomplete panoramic image and and the global lighting estimated by I-Net as inputs and outputs a dense pixel-wise prediction of local lighting panorama with full FOV and high-frequency details, as

(2)

P-Net is implemented as a fully convolutional U-Net [unet].

4.3 Loss Functions

Direct supervision loss. Direct supervision for I-Net is provided to 1) diffuse albedo predictions via loss, 2) shadow predictions via loss, 3) surface normal predictions via cosine loss, 4) plane distance map predictions via loss, and 5) sky environment map predictions via loss. Then direct supervision for P-Net is provided to local lighting predictions via loss.

(3)
(4)

where the means the estimations of I-Net and is the dot product for each vector in a matrix.

Diffuse convolution loss. In order to guide the sky environment map estimated by I-Net to extract low-frequency lighting information from the encoded SH coefficients, we add a diffuse convolution loss to force applied with the diffuse convolution to have a close appearance with a relighted pure Lambertian surface from :

(5)

where is the global diffuse albedo, is the SH coefficients by reshaping , is the normal map of a sphere in panorama coordination and the second order SH basis is given by: . is the diffuse convolution function defined as

(6)

where is the hemisphere centered at pixel on the global lighting environment map, is the normal vector at pixel , and is the sum of solid angles on . is a unit vector of direction and is the solid angle for a pixel in the panorama map of direction with different scale factors (because pixels in the panorama map at different latitudes correspond to projections on the unit sphere with different area sizes).

Inverse rendering reconstruction loss. To make the network learn constraints from physically-based image formation model, we put SH coefficients as an intermediate variable and provide indirect supervision to via an inverse rendering reconstruction loss on the directly illuminated part, by multiplying a non-shadowed mask to disregard the effect of shadows:

(7)

where represents the element-wise product. We use a fixed to compress the dynamic range. is the non-shadowed mask computed using shadow maps from intrinsics; a binary Otsu segmentation on histogram of shadow maps is further used to eliminate weak interreflections; is the RGB image matrix; is the estimated diffuse albedo matrix; is a matrix stacked by which applied SH basis on the normal map.

Tonemapped SSIM loss. A structural similarity index measure () loss between dynamic range compressed images with a fixed gamma parameter ( in our experiment) is used to recover structure similarity between the estimation and ground truth:

(8)

where is the exposure intensity fixed as in our experiments.

Figure 6: Quantitative evaluation of sun position estimation. (a) The cumulative sun angular error comparison between baseline methods and ours. The estimation error of sun azimuth (b) and elevation angles (c) is displayed as a “violin plot” where the envelope of each bin represents the percentile, the gray line represents the percentile of 25 to 75, and the median is shown as a white point.

I-Net is trained by summing up direct supervision loss, diffuse convolution loss, and inverse rendering reconstruction loss as: , and then P-Net is trained by summing up direct supervision loss and tonemapped SSIM loss as: .

Figure 7: Relighting results with global lighting (shown as insets) on our SOLID-Img dataset.

5 Experiments

We perform detailed network analysis and present qualitative and quantitative results on our SOLID-Img test set. We also capture a small set of real LDR outdoor local environment maps to analyse the generalization of our method. Finally, we show relighted bunny results to validate our methods qualitatively222More results are in the supplementary material.. To measure the accuracy of our predicted global sky environment map and local illumination maps , we use mean absolute error (MAE) on the HDR sky environment map, angular error on the sun position and sun azimuth/elevation angles, and SSIM on the detailed local lighting as error metrics.

5.1 Analysis using Synthetic Dataset

Effectiveness of I-Net. To validate the design of intrinsic decomposition, we compare our global lighting estimation branch with three baseline models for the accuracy of estimated sun positions: 1) is a regression-based model that directly regresses the global sky from the input image. 2) is a two-stream convolution network used to regress sun azimuth angle and normalized HDR panorama from an LDR panorama [ldr2hdr]; we modify the input as a single limited-FOV image to adapt our task. 3) learns to estimate both the sun azimuth angle and a non-parametric sky [deepsky]. In particular, learns azimuth estimation as a regression task, while treats it as a classification problem. All baseline models are retrained using SOLID-Img training dataset with the same setting333Detailed model structures are in the supplementary material.. Since our global lighting is represented by a non-parametric sky environment map, we compute the sun position by finding the largest connected component of the sky above a threshold (98%) and computing its centroid. And then we rotate estimated sky environment maps around their azimuth angles to make sure the sun is in the center of the image so that we can compare it with their baseline models.

From Figure 6, we can see that our method shows significant improvement than and and comparable improvement than , thanks to the intrinsic cues. Qualitative results on the test dataset are shown in Figure 7444Numerical results and MAE errors on estimated sky environment map are provided in the supplementary material.. Our relighting results and estimated lightings show a closer appearance to the ground truth (shown as insets) than other methods.

To help understanding how SH coefficients decode the global lighting information, we perform the Grad-CAM [gradcam] on our global lighting encoder. We use the maximum response value of SH coefficients as the target backward label to find which regions of input are important for global lighting prediction. From Figure 8, the feature heatmaps validate that I-Net mostly captures directly illuminated information to estimate global lighting.

Figure 8: Visualization of Grad-CAM [gradcam] on our SH lighting prediction using SOLID-Img test set.
Inputs SSIM MAE
+
+
+
+ + 0.803 0.523
Table 1: Ablation study about our multi-input module.

Effectiveness of P-Net. We train our P-Net with combinations of different inputs: warped incomplete LDR image panorama , relighted Lambertian surface , estimated sky environment map , and incomplete shadow panorama image . During training, we only process direct supervision on local lighting. We evaluate the SSIM and MAE errors between the estimated local lighting and ground truth. From Table 1, we can tell that directly providing rather than improves our algorithm marginally, while also providing improves it a bit more. We conjecture it is because shadows provide occlusion information which is helpful for lighting estimation. In Figure 9, we show results without global lighting, with as global lighting, and with as global lighting, respectively. We find that P-Net is incapable to learn the correct sun position only with the warped color image but can recover it accurately by adding as shown in the first column and third column. Although the sun position is well recovered with , the sun intensity still has a large gap from the real condition.

For an off-the-shelf renderer (e.g., Blender), we can achieve multi-object rendering by setting it to render only the object in the selected lighting position, and then blending this result with rendering results from other positions through the alpha channel. In Figure 10, we show the visual quality of synthetic object insertion to better illustrate the usefulness of spatially-varying outdoor lighting estimation. As can be observed, our method can render correct lighting effects (specular highlights and shadows) on rabbits under different materials.

Figure 9: Estimated local lighting with different inputs.

Effects of different losses.

To verify the necessity of each loss function, we evaluate the performance of

I-Net and P-Net using different combinations of loss functions. In Figure 6, we can observe that our models have comparable improvement to even without due to constraint from intrinsics provided by I-Net. By further adding , I-Net can learn the global sky environment map more effectively with the guidance of SH encoded lighting and then produce a more accurate sky estimation than . If is ablated, the performance on test dataset becomes / (SSIM / MAE); by adding this loss, the numbers are 0.798 / 0.552 (SSIM / MAE), which shows that P-Net can predict local lighting more accurately especially on structure similarity.

Figure 10: Synthetic examples of inserting virtual objects of different materials, compared with NeurIllum [indoorlocallighting_pano] and LENet [deepsky].
Figure 11: Comparison of estimated global and local lighting on our real test dataset. Column 1 shows the input image and selected pixel locations. Column 2 shows the estimated global lighting of DeepSky model [deepsky] and our method (blue star marks the estimated sun position by computing the centroid of largest connected component, while the green star marks the ground truth sun position labeled from a low-exposure environment map by us manually). Column 3 shows the LDR local lighting environment map. Columns 47 show the estimated local lighting by NeurIllum [indoorlocallighting_pano] and our methods in both LDR and HDR formats.
Figure 12: Qualitative comparison of relighting results using our real dataset.

5.2 Evaluation on Real Dataset

Real data capture. To validate that SOLID-Net is able to perform outdoor local lighting estimation, we capture real outdoor city street view scenes and the corresponding spatially-varying local environment maps (see Figure 11). The images are captured by a Ricoh Theta SC2 camera with dual fisheye lens. For local lighting environment maps, the scenes are captured 1/2500s shutter speed with 2.0 aperture by placing the panoramic camera as a light probe at different locations. Due to the limited dynamic range of our panoramic camera, the local environment maps are not able to faithfully record the intensity of sunlight. To obtain the accurate sun position for evaluation, we further capture a low-exposure panorama with 1/25000s shutter speed and label the sun position manually. The captured LDR local lighting is aligned to its view vector with respect to the camera facing direction. In total, our real test dataset includes 29 outdoor scenes and 67 LDR local lighting environment maps for evaluating our method quantitatively.

Comparison with previous work. We first compare the accuracy of global lighting estimation with model [deepsky] using sun position errors. The azimuth/elevation angular errors of are . In contrast, our method maintains a high accuracy with . From Column 2 of Figure 11, we can see that our method can generate a clearer environment map with different sky conditions and our estimated sun positions are closer to the ground truth than . To evaluate the estimated local lighting, we compare our method with NeurIllum [indoorlocallighting_pano] retrained on our synthetic dataset on estimated spatially-varying lighting quantitatively and qualitatively. Overall, our method achieves a better SSIM / MAE (the higher is better / the lower is better) performance of 0.235 / 0.203, compared to / for NeurIllum. Compared estimations of our method (Column 4-5) and NeurIllum (Column 6-7) with ground truth (Column 3) in Figure 11, we note that their method does not capture the accurate sun position and intensity, caused by missing panoramic information which our method handles well. We also show relighted bunny results to further compare estimated spatially-varying lighting effects of our method and NeurIllum (see Figure 12). These show that our approach adapts to strongly spatially-varying local lighting effects in real scenes.

Figure 13: Real examples of virtual object insertion.
Figure 14: Real examples of intrinsic decomposition.

6 Discussion

We present the first end-to-end outdoor spatially-varying lighting estimation framework and demonstrate it significantly outperforms previous works via extensive evaluations on both synthetic and real datasets. Our method is able to generalize on real scenes with a slightly different appearance from our synthetic scene. An example is shown in Figure 13, in which the virtual object is reasonably relit in a scene of rarely seen structures (with railway and glass) in the synthetic training data.

Limitations and future work. Due to the material diversity gap between synthetic and real data, the intrinsic decomposition results on real data may not be as accurate as those on synthetic data (Figure 14 compared with Figure 4). Although SOLID-Net estimates HDR lighting environment map to support realistic relighting effects, our lighting model is not suitable for generating animations that are sensitive at harsh lighting boundaries, which will be an interesting direction for future work.

References