Estimating outdoor lighting from a single image is one of the fundamental problems in computer vision. By providing outdoor scene properties from the physical aspect, it has huge impact on many applications,e.g.
, face/body relighting, scene understanding, augmented reality (AR), and so on. This task is rather challenging since images are formed by conflating lighting with complex surface reflectance distribution and object geometry. In the outdoor scenario, existing solutions usually employ low-dimensional parametric models such as the Hošek-Wilkie (HW) sky model[outdoorlighting_param]
with four parameters to fit the sky illumination. The capacity of parametric models is not sufficient to represent the complex real-world illumination, and a recent non-parametric approach using an autoencoder to learn the sky illumination model from a large-scale sky panorama dataset and encoding the lighting information from a single limited Field-of-View (FOV) image shows more promising results[deepsky].
However, as far as we know, all existing outdoor lighting estimation methods [outdoorlighting_param, deepsky, allweather] only consider the outdoor illumination as a single global map without any spatially-varying consideration, i.e., the light probe is surrounded by an environment map that casts rays from infinitely far away. A spatially-varying lighting estimation has proved to be successful in indoor scenarios, which is achieved by modeling local indoor lighting using low-frequency parametric lighting represented by spherical harmonics (SH) [indoorlocallighting_SH, Barron2013] or panoramic environment map [indoorlocallighting_pano].
Extending spatially-varying lighting estimation from indoor to outdoor is non-trivial in three aspects: 1) The extremely high-dynamic-range (HDR) sunlight and the complicated sky light under different weather conditions make it more difficult to parameterize outdoor than indoor lighting [indoorlocallighting_SH, Barron2013], while the existing non-parametric sky model [deepsky]
treats it as a pure deep learning task without considering physics image formation constraint. 2) Non-parametric spatially-varying local lighting estimation is highly ill-posed, since different 3D locations should have different lighting observations and the majority of local observation is missing[indoorlocallighting_pano]. 3) HDR and panoramic images capturing local lighting and geometry information in outdoor are not yet available, despite there are many datasets for such a purpose in the indoor scenario by synthetically generating scenes from SUNCG [suncg] and Matterport3D [Matterport3D].
In this paper, we propose SOLID-Net, a neural network for Spatially-varying Outdoor Lighting estimation using cues from Intrinsic image Decomposition, as shown in Figure 1. We tackle the three major challenges mentioned above by proposing a two-stage framework: 1) We train a single-in-multi-output CNN to decompose an input image into intrinsic parts: albedo (material-related), normal and plane distance (geometry-related), and shadow (lighting-related). These intrinsics provide a physically-based shading constraint by fitting SH-represented global lighting with low-frequency information, which is then combined with extracted sky features from the input image to generate a non-parametric sky model like [deepsky]. 2) With the estimated geometry from decomposed intrinsics, we further warp the input image and estimated shadow map with limited FOV to a spherical projection centered at the target location, which provides panoramic observation to reduce the ill-posed issue. This is then combined with global sky lighting from the previous step as input to train a multi-input-single-output CNN for complementing high-frequency local lighting estimation. 3) We use the Blender SceneCity [scenecity] to create city models that contain a large set of outdoor scenes and render a synthetic outdoor lighting estimation dataset with labeled location information and corresponding lighting effects using a physically-based path-tracer to facilitate the training of our network. SOLID-Net demonstrates significant improvements over other methods by making contributions in
integrating shading constraint from intrinsic decomposition into the global sky lighting estimation;
producing high-frequency local lighting estimation via panoramic warping and shadow map reference; and
building the first spatially-varying outdoor lighting estimation dataset with ground truth labels.
2 Related Work
Outdoor lighting estimation. Stumpfel et al. [paul04] proposed to explicitly capture the HDR outdoor lighting environments that include the sun and sky with multiple exposures. Lalonde et al. [lalonde12] first proposed lighting estimation from a single, generic outdoor scene. Their approach relied on multiple cues (such as shadows, shading, and sky appearance variation) extracted from the image. There are solutions using parametric models to represent outdoor lighting: Cheng et al. [cheng2018] estimated lighting from the front and back camera of a mobile phone. However, they represented lighting using low-frequency SH, which does not appropriately model outdoor lighting. Hold-Geoffroy et al. [outdoorlighting_param] learned to estimate Hošek-Wilkie (HW) sky model parameters from a single image, which is further extended by Zhang et al. [allweather] with a more flexible parametric Lalonde-Matthews (LM) sky model. To include more information of the sky, Hold-Geoffroy et al. [deepsky] designed an autoencoder to learn a non-parametric sky model from a large sky panorama dataset [skydatabase] and trained a network to learn the sky lighting from limited-FOV images. LeGendre et al. [legendre19] used a mobile phone camera with three different reflective spheres to capture lighting ground truth and used these data to train their deep model effectively, but these spheres are still global lighting probes.
Local lighting estimation. A direct way of obtaining the local lighting of an environment is to capture the lighting intensity at a target location using a probe of known shape. Debevec et al. [paul98] showed that HDR environment maps can be captured with a reflective metallic sphere captuerd with the scene. Barron an Malik [Barron2013] decomposed the scene into intrinsic components including spatially-varying SH-based lighting, but it required an RGBD image as input and relied on hand-crafted priors. To learn local lighting representation, Garon et al. [indoorlocallighting_SH] predicted fifth-order SH coefficients from an input image and local patches with synthetic data. In more recent progress, Li et al. [indoorlighting_inverserender] proposed a dense spherical Gaussian lighting representation with differentiable rendering to conduct scene editing. But all the methods mentioned above only considered indoor parametric lighting and are hard to be extended to outdoor lighting. Song et al. [indoorlocallighting_pano] proposed a cascaded model (denoted as NeurIllum for brevity) to recover high-frequency local lighting with warped color image according to recovered geometry, which showed promising texture details but the lighting positions are sometimes less accurate due to the loss of massive information in the panorama.
A large dataset containing HDR images and their corresponding illumination measured at different locations in a scene is required to learn to estimate outdoor intrinsics and local lightings. Existing outdoor panorama datasets, such as [deepsky, ldr2hdr] only provide a single global illumination map assuming distant lighting, which cannot be used to learn local lighting estimation. To provide training data for solving “SOLID” problem, we introduce the SOLID-Img, a dataset for Spatially-varying Outdoor Lighting estimation with ground truth Intrinsic Decomposition labels and a large amount of rendered Images as shown in Figure 2.
3.1 Data Generation
We adopt 3D city models from the Blender SceneCity [scenecity] to create synthetic scenes. In Blender SceneCity, there are 450 unique objects in 80 material categories. The object models provide surface materials, including diffuse albedo, roughness, and transparency, which are used to obtain photo-realistic renderings.
Camera setting. For each road block, we select a set of cameras with diverse views seeing most objects in the context, to provide comprehensive information for lighting estimation, as shown in Figure 2(a). Our process starts by selecting the “best” camera [physicallyrender] for each of the six horizontal view direction sectors in every road block. For each sector, we select the view with the highest percentage pixel coverage according to item buffer, as long as it has more than three object categories111More details are in the supplementary material..
Scene rendering. We collect 70 HDR environment maps from HDRI Haven [hdrihaven] which cover different solar zenith angles from sunrise to sunset. To simulate different sunlight directions, we rotate each HDR environment map along the latitude direction with a random angle sampled uniformly in . Then we render images using the camera settings above and these HDR environment maps with the physically-based Blender Cycles rendering engine [blender], to generate photo-realistic renderings. The resolution is set as with a physically-based path tracer of 512 samples. We record the material buffer (diffuse albedo buffer, normal buffer, depth buffer) as intermediate ground truth. We represent 3D geometry using the surface normal and plane distance, and render both as suggested in [im2pano3d]. To render shadows, we set the whole scene as a single Lambertian material and render it twice with shadow turned on and off respectively, from which shadow maps are calculated by checking the difference, as shown in Figure 2(b).
Local lighting collections. To obtain the ground truth of global lighting, we save the rotated environment maps with solution. To collect local lighting, we randomly sample 4 locations in the scene to render 4 local light probes. The image is split into 4 quadrants, and a random 2D coordinate is sampled uniformly in each quadrant (excluding the sky part and 5% pixels near the image boundary). The 3D centers of local cameras are calculated by casting a ray from the camera recording the scene to the surface of the scene and getting the first intersection point. From that point, we move the local camera center cm away along the plane surface normal to prevent large invalid pixels and render a local light probe at this position with resolution. All local light probes are rendered in the equirectangular representation, as shown in Figure 2(c).
3.2 Data Filtering
Inspired by [physicallyrender], we remove low-quality renderings that are with different color distribution with natural images, e.g., with overly low or high intensities. To obtain a prior color distribution on real images, we compute normalized color histogram for 1100 selected real images from the Google Street View Dataset [googlestreetview]. For each rendered image, we calculate the histogram similarity with those from Google Street View as the sum of minimal value of each bin; and then we assign it with a score calculated as the largest histogram similarity by comparing it to all real images; finally, we select all images with color similarity score larger than , as shown in Figure 2(d). This process selects 38000 images from the initially rendered image set, composing the SOILD-Img dataset. Then care is taken to split the dataset into the train/test set according to different lighting conditions.
This section introduces the design methodology of SOLID-Net whose pipeline is shown in Figure 3. It is a two-stage framework that learns to reconstruct locally HDR outdoor environment maps, trained with the SOLID-Img dataset introduced in Section 3.
4.1 Problem Formulation
We formulate illumination estimation as a regression problem. Given an LDR image with limited FOV and a selected pixel location “” in homogeneous coordinate , our model outputs an HDR illumination centered around the 3D location of the pixel “” and a global sky environment HDR illumination , where both and are represented as a panoramic image with full FOV.
4.2 Network Architecture
A straightforward approach to estimating the outdoor lighting from the scene would be to simply take the single limited FOV image as input, encode it into a feature map using a CNN, and feed the feature map into a lighting-regression sub-network [lalonde17, deepsky]. Unsurprisingly, we find it results in outdoor lighting estimation with higher error (see Figure 6), presumably because it is difficult for the network to understand how to extract full FOV lighting from a limited FOV image. One way to improve it is to bring regularization from the Lambertian rendering equation [inverserendernet], which however is challenging for outdoor spatially-varying lighting estimation because: 1) Outdoor scenes have large areas of shadow occlusion which cannot be directly fitted by the Lambertian model. 2) SH lighting has a limited dynamic range and it is too smooth to represent sharp sky lighting and detailed texture. Therefore, we propose a two-stage framework to jointly solve these problems by: 1) proposing an intrinsic image decomposition network (denoted as I-Net) that takes a limited FOV image as input and estimates its intrinsic components as well as a global sky environment map and 2) designing a panoramic completion module (denoted as P-Net) that estimates local lighting from outputs in the previous stage and the input location.
I-Net. As shown by blue blocks in Fig 3, I-Net takes a single limited-FOV LDR image as input and various outputs including diffuse albedo , surface normal , plane distant map [im2pano3d], shadow map , and second-order SH coefficients , and the sky environment map generated from SH coefficients and sky features. We use a single encoder to capture global features of intrinsic information, and then use five decoders for , , , , and
followed by a decoder lighting branch for sky environment map regression. Skip links are used for preserving details. In particular the sky map regression, we use a fully-connected (FC) layer to process the output feature maps of lighting branch encoder to generate a latent vector of size 27 (second-order SH in RGB). For the decoder, we reshape this vector and upsample it 8 times, and then combine it by flipping padded sky features to generate a 256128 HDR sky environment map. The lighting information encoded in can be considered as the low-frequency form of
, and it is used to guide the recovery of the high-frequency sky environment maps with sky features extracted from the input images. In summary, I-Net predicts intrinsic components (examples are provided in Figure4) and sky environment map:
P-Net. The 3D location is calculated by the predicted normal vector in and plane distance in by I-Net for each pixel. If the camera intrinsic matrix is fixed as and 2D pixel locations of the whole image are provided, we can reproject them as a 3D scene by , where .
By using we warp the input image and estimated shadow map and spatially align them with the output local lighting to provide panoramic observation. First, we compute the camera location according to the input point position and apply cm translation (defined in Section 3) along the normal direction of the supporting plane to align with training data. Second, we perform a panoramic warping through a forward projection using the estimated geometry and camera location to map pixels in and as panoramic images (an example is provided in Figure 5). The Z-buffer is computed to discard invisible points and the points without projected positions are set to 0.
Since the local lightings share the same camera rotation, the sky parts in local lighting should be consistent, this motivates us to take the sky as an input to P-Net. As shown by orange blocks in Fig 3, P-Net concatenates the two incomplete panoramic image and and the global lighting estimated by I-Net as inputs and outputs a dense pixel-wise prediction of local lighting panorama with full FOV and high-frequency details, as
P-Net is implemented as a fully convolutional U-Net [unet].
4.3 Loss Functions
Direct supervision loss. Direct supervision for I-Net is provided to 1) diffuse albedo predictions via loss, 2) shadow predictions via loss, 3) surface normal predictions via cosine loss, 4) plane distance map predictions via loss, and 5) sky environment map predictions via loss. Then direct supervision for P-Net is provided to local lighting predictions via loss.
where the means the estimations of I-Net and is the dot product for each vector in a matrix.
Diffuse convolution loss. In order to guide the sky environment map estimated by I-Net to extract low-frequency lighting information from the encoded SH coefficients, we add a diffuse convolution loss to force applied with the diffuse convolution to have a close appearance with a relighted pure Lambertian surface from :
where is the global diffuse albedo, is the SH coefficients by reshaping , is the normal map of a sphere in panorama coordination and the second order SH basis is given by: . is the diffuse convolution function defined as
where is the hemisphere centered at pixel on the global lighting environment map, is the normal vector at pixel , and is the sum of solid angles on . is a unit vector of direction and is the solid angle for a pixel in the panorama map of direction with different scale factors (because pixels in the panorama map at different latitudes correspond to projections on the unit sphere with different area sizes).
Inverse rendering reconstruction loss. To make the network learn constraints from physically-based image formation model, we put SH coefficients as an intermediate variable and provide indirect supervision to via an inverse rendering reconstruction loss on the directly illuminated part, by multiplying a non-shadowed mask to disregard the effect of shadows:
where represents the element-wise product. We use a fixed to compress the dynamic range. is the non-shadowed mask computed using shadow maps from intrinsics; a binary Otsu segmentation on histogram of shadow maps is further used to eliminate weak interreflections; is the RGB image matrix; is the estimated diffuse albedo matrix; is a matrix stacked by which applied SH basis on the normal map.
Tonemapped SSIM loss. A structural similarity index measure () loss between dynamic range compressed images with a fixed gamma parameter ( in our experiment) is used to recover structure similarity between the estimation and ground truth:
where is the exposure intensity fixed as in our experiments.
I-Net is trained by summing up direct supervision loss, diffuse convolution loss, and inverse rendering reconstruction loss as: , and then P-Net is trained by summing up direct supervision loss and tonemapped SSIM loss as: .
We perform detailed network analysis and present qualitative and quantitative results on our SOLID-Img test set. We also capture a small set of real LDR outdoor local environment maps to analyse the generalization of our method. Finally, we show relighted bunny results to validate our methods qualitatively222More results are in the supplementary material.. To measure the accuracy of our predicted global sky environment map and local illumination maps , we use mean absolute error (MAE) on the HDR sky environment map, angular error on the sun position and sun azimuth/elevation angles, and SSIM on the detailed local lighting as error metrics.
5.1 Analysis using Synthetic Dataset
Effectiveness of I-Net. To validate the design of intrinsic decomposition, we compare our global lighting estimation branch with three baseline models for the accuracy of estimated sun positions: 1) is a regression-based model that directly regresses the global sky from the input image. 2) is a two-stream convolution network used to regress sun azimuth angle and normalized HDR panorama from an LDR panorama [ldr2hdr]; we modify the input as a single limited-FOV image to adapt our task. 3) learns to estimate both the sun azimuth angle and a non-parametric sky [deepsky]. In particular, learns azimuth estimation as a regression task, while treats it as a classification problem. All baseline models are retrained using SOLID-Img training dataset with the same setting333Detailed model structures are in the supplementary material.. Since our global lighting is represented by a non-parametric sky environment map, we compute the sun position by finding the largest connected component of the sky above a threshold (98%) and computing its centroid. And then we rotate estimated sky environment maps around their azimuth angles to make sure the sun is in the center of the image so that we can compare it with their baseline models.
From Figure 6, we can see that our method shows significant improvement than and and comparable improvement than , thanks to the intrinsic cues. Qualitative results on the test dataset are shown in Figure 7444Numerical results and MAE errors on estimated sky environment map are provided in the supplementary material.. Our relighting results and estimated lightings show a closer appearance to the ground truth (shown as insets) than other methods.
To help understanding how SH coefficients decode the global lighting information, we perform the Grad-CAM [gradcam] on our global lighting encoder. We use the maximum response value of SH coefficients as the target backward label to find which regions of input are important for global lighting prediction. From Figure 8, the feature heatmaps validate that I-Net mostly captures directly illuminated information to estimate global lighting.
Effectiveness of P-Net. We train our P-Net with combinations of different inputs: warped incomplete LDR image panorama , relighted Lambertian surface , estimated sky environment map , and incomplete shadow panorama image . During training, we only process direct supervision on local lighting. We evaluate the SSIM and MAE errors between the estimated local lighting and ground truth. From Table 1, we can tell that directly providing rather than improves our algorithm marginally, while also providing improves it a bit more. We conjecture it is because shadows provide occlusion information which is helpful for lighting estimation. In Figure 9, we show results without global lighting, with as global lighting, and with as global lighting, respectively. We find that P-Net is incapable to learn the correct sun position only with the warped color image but can recover it accurately by adding as shown in the first column and third column. Although the sun position is well recovered with , the sun intensity still has a large gap from the real condition.
For an off-the-shelf renderer (e.g., Blender), we can achieve multi-object rendering by setting it to render only the object in the selected lighting position, and then blending this result with rendering results from other positions through the alpha channel. In Figure 10, we show the visual quality of synthetic object insertion to better illustrate the usefulness of spatially-varying outdoor lighting estimation. As can be observed, our method can render correct lighting effects (specular highlights and shadows) on rabbits under different materials.
Effects of different losses.
To verify the necessity of each loss function, we evaluate the performance ofI-Net and P-Net using different combinations of loss functions. In Figure 6, we can observe that our models have comparable improvement to even without due to constraint from intrinsics provided by I-Net. By further adding , I-Net can learn the global sky environment map more effectively with the guidance of SH encoded lighting and then produce a more accurate sky estimation than . If is ablated, the performance on test dataset becomes / (SSIM / MAE); by adding this loss, the numbers are 0.798 / 0.552 (SSIM / MAE), which shows that P-Net can predict local lighting more accurately especially on structure similarity.
5.2 Evaluation on Real Dataset
Real data capture. To validate that SOLID-Net is able to perform outdoor local lighting estimation, we capture real outdoor city street view scenes and the corresponding spatially-varying local environment maps (see Figure 11). The images are captured by a Ricoh Theta SC2 camera with dual fisheye lens. For local lighting environment maps, the scenes are captured 1/2500s shutter speed with 2.0 aperture by placing the panoramic camera as a light probe at different locations. Due to the limited dynamic range of our panoramic camera, the local environment maps are not able to faithfully record the intensity of sunlight. To obtain the accurate sun position for evaluation, we further capture a low-exposure panorama with 1/25000s shutter speed and label the sun position manually. The captured LDR local lighting is aligned to its view vector with respect to the camera facing direction. In total, our real test dataset includes 29 outdoor scenes and 67 LDR local lighting environment maps for evaluating our method quantitatively.
Comparison with previous work. We first compare the accuracy of global lighting estimation with model [deepsky] using sun position errors. The azimuth/elevation angular errors of are . In contrast, our method maintains a high accuracy with . From Column 2 of Figure 11, we can see that our method can generate a clearer environment map with different sky conditions and our estimated sun positions are closer to the ground truth than . To evaluate the estimated local lighting, we compare our method with NeurIllum [indoorlocallighting_pano] retrained on our synthetic dataset on estimated spatially-varying lighting quantitatively and qualitatively. Overall, our method achieves a better SSIM / MAE (the higher is better / the lower is better) performance of 0.235 / 0.203, compared to / for NeurIllum. Compared estimations of our method (Column 4-5) and NeurIllum (Column 6-7) with ground truth (Column 3) in Figure 11, we note that their method does not capture the accurate sun position and intensity, caused by missing panoramic information which our method handles well. We also show relighted bunny results to further compare estimated spatially-varying lighting effects of our method and NeurIllum (see Figure 12). These show that our approach adapts to strongly spatially-varying local lighting effects in real scenes.
We present the first end-to-end outdoor spatially-varying lighting estimation framework and demonstrate it significantly outperforms previous works via extensive evaluations on both synthetic and real datasets. Our method is able to generalize on real scenes with a slightly different appearance from our synthetic scene. An example is shown in Figure 13, in which the virtual object is reasonably relit in a scene of rarely seen structures (with railway and glass) in the synthetic training data.
Limitations and future work. Due to the material diversity gap between synthetic and real data, the intrinsic decomposition results on real data may not be as accurate as those on synthetic data (Figure 14 compared with Figure 4). Although SOLID-Net estimates HDR lighting environment map to support realistic relighting effects, our lighting model is not suitable for generating animations that are sensitive at harsh lighting boundaries, which will be an interesting direction for future work.