GMLight: Lighting Estimation via Geometric Distribution Approximation

02/20/2021 ∙ by Fangneng Zhan, et al. ∙ Nanyang Technological University 9

Lighting estimation from a single image is an essential yet challenging task in computer vision and computer graphics. Existing works estimate lighting by regressing representative illumination parameters or generating illumination maps directly. However, these methods often suffer from poor accuracy and generalization. This paper presents Geometric Mover's Light (GMLight), a lighting estimation framework that employs a regression network and a generative projector for effective illumination estimation. We parameterize illumination scenes in terms of the geometric light distribution, light intensity, ambient term, and auxiliary depth, and estimate them as a pure regression task. Inspired by the earth mover's distance, we design a novel geometric mover's loss to guide the accurate regression of light distribution parameters. With the estimated lighting parameters, the generative projector synthesizes panoramic illumination maps with realistic appearance and frequency. Extensive experiments show that GMLight achieves accurate illumination estimation and superior fidelity in relighting for 3D object insertion.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 6

page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Lighting estimation aims to recover panoramic illumination from a single image with a limited field of view. It has a wide range of applications in various computer vision and computer graphics tasks such as object relighting in mixed reality, etc. However, lighting estimation is a typical under-constrained problem since it aims to recover a full-view illumination map from an image with a limited field of view. In addition, high-dynamic-range (HDR) illumination needs to be inferred from low-dynamic-range (LDR) observations for the purpose of realistic object relighting.

Figure 1: Illustration of our proposed Geometric Mover’s Light (GMLight). GMLight treats illumination maps as discrete distributions in geometric spaces, as defined by the scene depth. Inspired by earth mover’s distance, we design a geometric mover’s loss (GML) to measure the distance between two geometric distributions by calculating the minimum distance of moving one distribution to another. GML aims to find the best Moving Plan (with the minimum total distance), as illustrated by the connections between two distributions. The thickness of connecting lines denotes the amount of ‘Earth’ moved between the two points.

Lighting estimation has been investigated by direct generation of illumination maps gardner2017 ; song2019 ; srinivasan2019lighthouse or regression of representation parameters, such as spherical harmonic parameters cheng2018shlight ; garon2019fast and spherical Gaussian parameters gardner2019deeppara ; li2020rendering . However, representation-based methods often struggle to accurately regress frequency information (especially high-frequency information), which often leads to inaccurate shading and shadow effects in object relighting garon2019fast ; li2019spherical . Meanwhile, directly generation methods enable certain high-frequency information to be regressed, but often leads to poor generalization capability gardner2017 ; chen2019neural .

In our earlier work zhan2021emlight , we designed a network named EMLight to model lighting with spherical distributions, and introduced a spherical mover’s loss to regress the spherical distribution parameters. However, EMLight employs a simplified spherical surface to represent the geometric space when modeling a lighting distribution, ignoring the potential depth in real-world illumination scenes. As a result, it often suffers from low geometry accuracy in light estimation and cannot handle spatially-varying illuminations. In this work, we propose Geometric Mover’s Light (GMLight) to model illuminations by geometric distributions which are closer to real-world illumination distributions (as compared with the simplified spherical distribution zhan2021emlight ) and thus provide higher accuracy. Specially, we introduce auxiliary depths to capture scene geometry, which enables effective estimation of spatially-varying illumination. Furthermore, we design a progressive Gaussian guidance scheme for lighting generation which significantly outperforms the fixed Gaussian guidance in EMLight.

Specifically, the GMLight (Geometric Mover’s Light) introduced in this work represents lighting by discrete distributions in a geometric space (defined by scene depths as illustrated in Fig. 1) and formulates lighting estimation as a distribution regression task. It consists of a regression network for lighting parameters prediction and a generative projector for illumination map synthesis. For lighting parameters estimation, we propose a distribution-based representation method to decompose illumination scene into four components: depth value, light distribution, light intensity, ambient term. Note that the last two are scalars and can be directly regressed with a naive L2 loss. Light distributions and scene depths, in contrast, are spatially-localized in scenes, and therefore are not suitable to be regressed by the naive L2 loss as it does not capture geometry information. Inspired by the earth mover’s distances emd between distributions, we design a Geometric Mover’s Loss (GML), that regresses light distributions and scene depths with an ‘earth mover’ in a geometric space. GML aims to search for an optimal plan to move one distribution to another with minimal moving distance. It can effectively capture spatial information when regressing light distributions and scene depths.

With the estimated illumination scene parameters, the generative projector generates illumination maps with realistic frequency and appearance in an adversarial manner. In detail, spherical Gaussian function gardner2019deeppara is adopted to map the estimated light parameters into a Gaussian map (a panoramic image), which serves as the guidance in illumination map generation. The Gaussian map can be constructed at each position in a scene (for spatially-varying illuminations) with knowledge of the scene geometry as reflected by depth values. To provide effective guidance at different generation stages, we adopt Gaussian functions of different radii to generate Gaussian maps in a coarse-to-fine manner. Since illumination maps are panoramic images that usually suffer from spherical distortions at different latitudes, we adopt spherical convolutions spherenet to accurately generate panoramic illumination maps. More details will be provided in the ensuing subsections.

The rest of this paper is organized as follows. Section 2 presents related works. The proposed method is then described in detail in Section 3. Experimental results are further presented and discussed in Section 4. Finally, concluding remarks are drawn in Section 5.

Figure 2: Illumination representation and regression. Given an Illumination Map, we first determine a Light Source and Ambient Region via thresholding and then assign light source pixels to anchor points as in Gaussian Map (visualized by spherical Gaussian function). The illumination map is thus represented by light distribution, light intensity, and ambient term. Scene depths (dataset provided) are also assigned to the anchor points as ground truth. The regression network at the bottom takes a local region of the illumination map (highlighted by the red box) as input and employs DenseNet-121 with four fully-connected (FC) layers to regress the light distribution, light intensity, ambient term and scene depths. The estimated parameters are then fed to the Generative Projector for illumination generation.
Figure 3: Visualization of 128 anchor points (pre-defined as in vogel1979 ) on a unit sphere and panorama.

2 Related Works

Lighting estimation is a classic challenge in computer vision and computer graphics, and it is critical for realistic relighting during object insertion lalonde2012 ; geoffroy2017 ; murmann2019dataset ; boss2020 ; ngo2019reflectance ; liao2019approximate ; maurer2018combining ; gupta2012combined and image synthesis zhan2019sfgan ; zhan2018verisimilar ; zhan2019esir ; zhan2019gadan ; zhan2019acgan ; zhan2019scene ; zhan2020towards ; zhan2020sagan ; xue2018accurate . Traditional approaches require user intervention or assumptions about the underlying illumination model, scene geometry, etc. For example, karsch2011 recovers parametric 3D lighting from a single image but requires user annotations for initial lighting and geometry estimates. zhang2016 requires a full multi-view 3D reconstruction of scenes. lombardi2016 estimates illumination from an object of known shape with a low-dimensional model. Maier et al. maier2017 make use of additional depth information to recover spherical harmonics (SH) illumination. Sengupta et al. sengupta2019 jointly estimate the albedo, normals, and lighting of an indoor scene from a single image. Barron et al. barron2015 estimate shape, lighting, and material but rely on data-driven priors to compensate for the lack of geometry information. Zoran et al. zoran2014 effectively estimate illumination from a single image by leveraging a generic viewpoint.

On the other hand, more recent works aim to estimate lighting from images by regressing representation parameters cheng2018shlight ; gardner2019deeppara ; li2020rendering or generating illumination maps gardner2017 ; song2019 . Cheng et al. cheng2018shlight regress the SH parameters of global lighting with a rendering loss. Gardner et al. gardner2017 generate illumination maps directly with a two-steps training strategy. Gardner et al. gardner2019deeppara estimate the positions, intensities, and colors of light sources and reconstruct illumination maps with a spherical Gaussian function. Garon et al. garon2019fast estimate lighting by predicting SH coefficients from a background image and local patch. Song et al. song2019 estimate per-pixel 3D geometry and use a convolutional network to predict unobserved content in the environment map. On top of this, Li et al. li2019spherical represent illumination maps with multiple spherical Gaussian functions and regress the corresponding Gaussian parameters for lighting estimation. Legendre et al. legendre2019deeplight regress HDR lighting from LDR images by comparing the sphere rendered with the predicted illumination to the ground truth. Srinivasan et al. srinivasan2019lighthouse estimate a 3D volumetric RGB model of a scene and uses standard volume rendering to estimate incident illuminations. Given any illumination map, the framework proposed by sun2019 is able to achieve relighting on an RGB portrait image taken in an unconstrained environment. Besides, several works liu2020shadow ; zhan2020aicnet adopt generative adversarial network to generate shadows without explicitly estimating the illumination map.

However, the aforementioned works either lose realistic frequency information or produce inaccurate light sources during illumination estimation. In contrast, we decompose illumination maps into four components and design a regression network with a geometric mover’s loss to estimate the decomposed illumination parameters with accurate spatial information. Using the estimated illumination parameters as guidance, our designed generative projector generates illumination maps accurately with realistic frequency information.

3 Proposed Method

GMLight consists of two sequential modules, including a regression network and a generative projector, as illustrated in Figs. 2 and 5. We propose a distribution-based representation to parameterize illumination scenes with four sets of parameters including light distribution , light intensity , ambient term , and depth , which are to be estimated by the regression network. The illumination scene parameters estimated by the regression network serve as guidance for the generative projector to synthesize realistic illumination maps.

3.1 Distribution-Based Illumination Representation

Fig. 2 illustrates our proposed distribution-based illumination representation. Given an image, we first separate the light sources from its HDR illumination map since the light sources in a scene are most critical for illumination prediction. Following gardner2019deeppara , we separate light sources by the top 5% of pixels with the highest values in the HDR illumination map. The light intensity and the ambient term can then be determined by the sum of all light-source pixels and the average of ambient region pixels, respectively. As in vogel1979 , we generate (

= 128 by default in this work) anchor points that are uniformly distributed on a unit sphere as illustrated in Fig.

3. Light-source pixels are assigned to their corresponding anchor point based on the minimum radian distance, and further determine the light-source value of the anchor point by summing all its affiliated pixels. Afterward, the value of light-source pixels is normalized by the intensity so that the anchor points form a standard discrete distribution on a unit sphere (i.e. the light distribution ). Since Laval Indoor dataset gardner2017 provides pixel-level depth annotations, the depth at each anchor point can be determined by averaging depth values of all the pixels that are assigned to the anchor point (similarly based on the minimum radian distance).

3.2 Illumination Parameter Prediction

The structure of the regression network is shown in Fig. 2. Four branches are adopted to regress the four sets of parameters, respectively. For the light intensity and ambient term , a naive L2 loss can be adopted for the regression. However, for the light distribution and depth values that are localized on a sphere, the naive L2 loss cannot effectively utilize the spatial information of the geometric distribution and the property of the standard distribution (for light distribution , the summation of all anchor point values is equal to one). We therefore take advantage of the subtleties (spatial information) of the geometric distribution and propose a novel geometric mover’s loss to regress the light distribution and depth values.

Geometric Mover’s Loss: Although a naive L2 loss or cross-entropy loss can be applied to regress the discrete distribution (namely the values of the N anchor points), these naive methods inevitably introduce several problems. Firstly, the L2 loss only regresses each anchor point separately and cannot effectively evaluate the discrepancy between two sets of anchor points (two distributions). Secondly, both the L2 loss and cross-entropy loss are not able to effectively utilize the spatial information of the light distribution and depth values in parameter regression.

Inspired by the earth mover’s distance which measures the discrepancy between two measures, we propose a novel geometric mover’s loss (GML) to measure the discrepancy between two discrete geometric measures. To derive the proposed GML, we define two discrete distributions with points on the sphere, denoted by and . Intuitively, GML can be treated as the minimum total cost required to transform to , where the cost is measured by multiplying the amounts of ‘earth’ to be moved and the distance to be moved. Then a transportation plan (or moving plan) matrix of size can be defined, where each entry denotes the the amounts of ‘earth’ moved between point and point . Besides, a cost matrix of size should be defined, where each entry gives the unit cost of moving to .

Figure 4: Geometric distances in GMLight and spherical distances in EMLight. Spherical distances in EMLight assume that anchor points are distributed on a spherical surface which neglects the complex geometry of real scenes. Geometric distances in GMLight instead capture real geometries of anchor points by using scene depths, i.e. the distance from the anchor point to the spherical center. Since the spherical angle between anchor points is known vogel1979 , geometric distances between anchor points can be computed.
Figure 5: Structure of the generative projector. Input Image is fed to an Encoder to produce a

Feature Vector

for the ensuing multi-stage spherical convolution. Multiple Gaussian Maps are acquired via spherical Gaussian mapping with different radius parameters, which are injected into the multi-stage generation process to synthesize the Output illumination map. F denotes spatially adaptive normalization park2019spade for feature injection.

In our previous work EMLight, the unit cost between point and point is measured by their radian distance along the unit sphere as shown in Fig. 4. However, the cost computation method used in EMLight simplifies the lighting distribution to be spherical and ignoring the complex geometry of real scenes. In contrast, we propose to model the lighting distribution with a geometric distance, determined by the depth map. As shown in Fig. 4, the distance from a geometric anchor point to the spherical center is the depth value instead of the radius, thus effectively reflecting the real geometry of the scene. The geometric distance (or unit cost) between anchor points and can be computed according to their depth values , and their spherical angle , as follows:

(1)

With the defined transportation plan matrix and cost matrix , GML can be formulated as the minimum total cost for the transport between and :

(2)

Different from light distributions, the spherical depth values don’t form a standard distribution (the sum of all depth values is not constant). Thus we introduce an unbalanced setting of GML for the regression of depth values. We handle it by introducing a relaxed version of classical earth mover’s distance, namely unbalanced earth mover’s distance chizat2016uot . It aims to determine an optimal transport plan between measures (spherical depth values) of different total masses. We formulate unbalanced GML by replacing the ‘Hard’ conservation of masses in (2) by a ‘Soft’ penalty with a divergence metric. An unbalanced GML as denoted by can thus be formulated as follows:

(3)

where is regularization parameter,

is the Kullback-Leibler divergence which is defined as:

(4)

To solve the problem in an efficient and differentiable way, Cuturi et al. cuturi2013sinkhorn introduced an entropic regularization term defined by . Thus, the original problem (2) can be formulated as follows:

(5)

where is the regularization coefficients denoting the smoothness of the transportation plan matrix . In our model, is empirically set to 0.0001. The unbalanced GML (3) can be regularized similarly by adding the entropic term . The regularized form of the problem (2) and (3) are then solved in an efficient and differentiable way by Sinkhorn iteration cuturi2013sinkhorn during network training.

Using GML to regress the geometric distribution has two clear advantages. First, it makes the regression sensitive to the global geometry, thus effectively penalizing predicted activation that is far from the ground truth distribution. Second, GML is smooth in training, providing stable optimization which is beneficial to the under-constraint problem of illumination prediction.

3.3 Illumination Generation

With the predicted parameters, including the light distribution , light intensity and ambient term , we can create a Gaussian map using a spherical Gaussian function gardner2019deeppara as follows:

(6)

where is the Gaussian map, is the number of anchor points, and denotes the RGB value of an anchor point, which is the product of the light distribution on this anchor point and the light intensity (namely ). Further, is the direction of an anchor point (pre-defined by Vogel’s method vogel1979 ), is a unit vector giving a direction on the sphere, and is the angular size (empirically set to 0.0025).

However, the constructed Gaussian map tends to lose the realistic illumination frequency information especially for high frequencies, which leads to weak shadow and shading effects as shown in Garon et al. garon2019fast . In contrast, the adversarial loss encourages the synthesis of high-frequency details as validated in sajjadi2017enhancenet . Thus, we propose a generative projector to synthesize a realistic illumination map through adversarial training. Specifically, we formulate the synthesis of the illumination map as a conditional image generation task with paired data as illustrated in Fig. 5, where the Gaussian maps serve as the conditional input.

Figure 6: Illustration of GMLight illumination estimation. For the Input Images in column 1, columns 2, 3, and 4 show the estimated Gaussian Maps (based on the regressed illumination parameters), Lighting Maps and Intensity Maps, respectively. Columns 5 and 6 show the ground truth of the corresponding intensity maps and illumination maps (i.e. Lighting Maps (GT)) and Intensity Maps (GT), respectively.

The overall architecture of the generation network is similar to SPADE park2019spade , as shown in Fig. 5. Instead of sampling a random vector, we encode the input image as a latent feature vector for the adversarial generation. The illumination map is a panoramic image, where pixels at different latitudes are stretched in different scales. As a result, the vanilla convolution suffers from heavy distortions at different latitudes, especially around the polar regions of the panoramic image. To address this, SphereNet spherenet

explicitly encodes invariance against latitude distortions into convolutional neural networks by adapting the sampling locations of the convolutional filters, as known as spherical convolution. We therefore adopt the spherical convolution (Spherical Conv) to generate the panoramic illumination map, effectively reversing distortions and wrapping the filters around the sphere.

In EMLight, the same conditional Gaussian map is injected into the generation process at different stages. Ideally, the conditional input should provide coarse-to-fine guidance for low-to-high generation stages. Thus, we adopt different radii in the Gaussian function to generate coarse-to-fine Gaussian maps for different generation stages, as shown in Fig. 5. The conditional Gaussian maps are injected into the generation process at multiple stages through spatially-adaptive normalization park2019spade (denoted by ‘F’), as illustrated in Fig. 5. More details on the proposed generative projector are provided in the supplementary file.

Spatially-Varying Projection: Spatially-varying illumination prediction aims to recover the illumination of different positions in a scene from a single image. As there are no annotations for spatially-varying illumination in the Laval Indoor dataset, we are unable to train a spatially varying model directly. Previous research gardner2019deeppara proposed to incorporate depth values into the projection to approximate the effect of spatially-varying illumination. We follow a similar idea with gardner2019deeppara to achieve the estimation of spatially varying illumination, as described below.

The Gaussian map is constructed through a spherical Gaussian function, as shown in Eq. (6). When we move the insertion position by , the new direction of the anchor point can be denoted by . The depth of the original insertion position and the new position are and , which can be obtained from the predicted depth value of anchor points. The light intensity at the new insertion position can thus be approximated by , and the Gaussian map of the new insertion position can be constructed as follows:

(7)

The Gaussian map is then fed into the generative projector to synthesize the final illumination map. Fig. 6 illustrates several samples of predicted Gaussian maps, generated illumination maps, visualized intensity maps and the corresponding ground truth. Fig. 10 illustrates the generated spatially-varying illumination maps at different insertion positions.

Metrics RMSE si-RMSE Angular Error User Study GMD
D S M D S M D S M D S M N/A
Gardner et al. gardner2017 0.146 0.173 0.202 0.142 0.151 0.174 8.12 8.37 8.81 28.0% 23.0% 20.5% 6.842
Gardner et al. gardner2019deeppara 0.084 0.112 0.147 0.073 0.093 0.119 6.82 7.15 7.22 33.5% 28.0% 24.5% 5.524
Li et al. li2019spherical 0.203 0.218 0.257 0.193 0.212 0.243 9.37 9.51 9.81 25.0% 21.5% 17.5% 7.013
Garon et al. garon2019fast 0.181 0.207 0.249 0.177 0.196 0.221 9.12 9.32 9.49 27.0% 22.5% 19.0% 7.137
Zhan et al. zhan2021emlight 0.062 0.071 0.089 0.043 0.054 0.078 6.43 6.61 6.95 40.0% 35.0% 25.0% 5.131
GMLight 0.051 0.064 0.078 0.037 0.049 0.074 6.21 6.50 6.77 42.0% 35.5% 31.0% 4.892
Table 1:

Comparison of GMLight with several state-of-the-art lighting estimation methods. The evaluation metrics include the widely used RMSE, si-RMSE, angular error, user study and GMD. D, S, and M denote diffuse, matte silver, and mirror materials of the rendered objects, respectively.

Loss Functions: The generative projector employs several losses to drive the generation of high-quality illumination maps. We denote the input Gaussian map as , the ground-truth illumination map as , and the generated illumination map as . To stabilize the training, we introduce a feature matching loss to match the intermediate features of the discriminator between the generated illumination map and the ground truth:

(8)

where represents the activation of layer in the discriminator and

denotes the balanced coefficients. To obtain a similar illumination distribution instead of excessively emphasizing the absolute intensity, a cosine similarity is computed between the generated illumination map and ground truth as follows:

(9)

where is the weight of this term. The discriminator adopts the same architecture with Patch-GAN isola2017pixel2pixel , thus obtaining the adversarial loss, denoted by . Then, the generative projector is optimized under the following objective:

(10)

As the regression network and generative projector are both differentiable, the whole framework can be optimized end-to-end.

4 Experiments

4.1 Datasets and Experimental Settings

We evaluate GMLight on the Laval Indoor HDR dataset gardner2017 that consists of 2,100 HDR panoramas taken in a variety of indoor environments. Similar to gardner2017 , we crop eight images with limited fields of view from each panorama, which produces 19,556 training pairs for our experiments. We apply the same image warping as in gardner2017 to each image. In our experiments, we randomly select 200 images as the testing set, and use the rest for training. In addition to the Laval Indoor HDR dataset, we also qualitatively evaluate GMLight on the dataset 111https://lvsn.github.io/fastindoorlight/ introduced in garon2019fast .

Following gardner2019deeppara and garon2019fast , we use DenseNet-121 huang2017densely

as backbone in regression network. The detailed network structure of the generative projector is provided in the supplementary file. We implemented GMLight in PyTorch and adopted the Adam algorithm

kingma2014adam

as the optimizer that employs a learning rate decay mechanism (the initial learning rate is 0.001). The network is trained on two NVIDIA Tesla P100 GPUs with a batch size of 4 for 100 epochs.

Figure 7: The scenes in evaluations are spheres of three types of materials including diffuse gray, matte silver, and mirror silver.
Figure 8: Qualitative comparison of GMLight with state-of-the-art methods: With illumination maps predicted by different methods (shown at the top-left corner of each rendered image), the rendered objects demonstrate different light intensities, colors, shadows, and shades.

4.2 Evaluation Method and Metrics

Similar to the evaluation settings in Legendre et al. legendre2019deeplight , our scene for model evaluation includes three spheres made of gray diffuse, matte silver and mirror, as illustrated in Fig. 7. The performance is evaluated by comparing the scene images rendered (by Blender blender ) with the predicted illuminations and the ground-truth illumination. The evaluation metrics include root mean square error (RMSE) and scale-invariant RMSE (si-RMSE), which focus on the estimated light intensity and light directions (or shadings), respectively. Both metrics have been widely adopted in the evaluation of illumination prediction. We also adopt the per-pixel linear RGB angular error legendre2019deeplight and Amazon Mechanical Turk (AMT) that performs crowdsourcing user study to subjectively assess the realism of the rendered images. In the experiments, each compared model predicts 200 illumination maps on the test set for quantitative evaluation. For qualitative evaluation, we design 25 different scenes for 3D insertion and render them with the predicted illumination maps.

Besides, we introduce a Geometric Mover’s distance (GMD) based on the geometric mover’s loss as described in (2) to measure the discrepancy between light distributions of illumination maps as below:

(11)

where and are the normalized illumination maps, and the pixels in the maps form geometric distributions. The GMD metric is sensitive to the scene geometry with the cost matrix , thus achieving a more accurate evaluation of illumination distribution (or directions) compared with si-RMSE.

4.3 Quantitative Evaluation

We compare GMLight with EMLight zhan2021emlight and several other state-of-the-art methods that either generate illumination maps directly gardner2017 or estimate representative illumination parameters garon2019fast ; li2019spherical ; gardner2019deeppara . For each compared method, we render 200 images of the testing scene (three spheres made of diffuse, matte silver, mirror silver materials) using the predicted illumination of the test set of the Laval Indoor dataset. Table 1 shows the experimental results, where D, S and M denote diffuse, matte silver and mirror material objects, respectively. The AMT user study was conducted by showing two images rendered by the ground truth and each compared method to 20 users to pick a more realistic image. The score is the percentage of rendered images that are deemed more realistic than the ground-truth rendering.

As can be observed, GMLight consistently outperforms all compared methods under different evaluation metrics and materials. EMLight zhan2021emlight simplifies the lighting distribution of scenes to be spherical, ignoring the complex scene geometry. GMLight introduces depth to model scene geometry which leads to more accurate illumination estimation. gardner2017 generates illumination maps directly, but it tends to overfit training data due to the unconstrained nature of illumination estimation from a single image. gardner2019deeppara regresses the spherical Gaussian parameters of light sources, but it often loses useful frequency information and generates inaccurate shading and shadows. li2019spherical adopts spherical Gaussian functions to reconstruct the illumination maps in the spatial domain but it often loses high-frequency illumination. garon2019fast recovers lighting by regressing spherical harmonic coefficients, but their model struggles to regress lighting directions and recover high-frequency information. Although garon2019fast adopts a masked L2 loss to preserve high-frequency information, it does not fully solve the problem as illustrated in Fig. 8. In contrast, GMLight estimates illumination parameters by regressing the light distribution under a geometric mover’s loss. With the estimated parameters, the generative projector generates accurate and high-fidelity illumination maps with realistic frequency information via adversarial training.

Figure 9: Object relighting over sample images from the Internet. For sample images in Column 1, our GMLight estimates lighting and illumination maps (highlighted by red boxes) automatically and applies them to relight virtual objects realistically as shown in Column 2.
Models RMSE si-RMSE Angular Error User Study GMD
D S M D S M D S M D S M N/A
SG+L2 0.204 0.213 0.238 0.188 0.203 0.229 9.18 9.42 9.73 26.0% 22.5% 18.0% 5.631
GD+L2 0.133 0.161 0.178 0.117 0.132 0.161 7.60 7.88 8.12 30.5% 25.5% 22.0% 5.303
GD+SML 0.080 0.103 0.117 0.072 0.087 0.106 6.78 6.98 7.12 34.0% 31.5% 26.0% 5.163
GD+GML 0.073 0.091 0.102 0.062 0.069 0.092 6.61 6.85 7.04 35.5% 32.0% 25.5% 5.031
GD+GML+GP 0.051 0.064 0.078 0.037 0.049 0.074 6.21 6.50 6.77 42.0% 35.5% 31.0% 4.892
Table 2:

Ablation study of the proposed GMLight. SG and GD denote spherical Gaussian representation and our proposed geometric distribution representation of illumination maps. L2 and GML denote L2 loss and geometric mover’s loss that are used in the regression of lighting parameters. GP denotes our proposed generative projector.

4.4 Qualitative Evaluation

We visualize our predicted Gaussian maps, generated illumination maps, and the corresponding intensity maps in Fig. 6. As can be seen, our regression network accurately predicts light distributions as shown in the Gaussian Map. The generative projector generates accurate and realistic HDR illumination maps as shown in the Generation. To further verify the quality of generated HDR illumination, we visualize the intensity maps of the illumination maps.

We qualitatively compare GMLight with four state-of-the-art light estimation methods. Fig. 8 shows the images rendered with the predicted illumination maps (highlighted by red boxes). As can be observed, GMLight predicts realistic illumination maps with plausible light sources and produces realistic renderings with clear and accurate shades and shadows. In contrast, direct generation gardner2017 struggles to identify the direction of light sources as there is no guidance for the generation. Illumination maps produced by Gardner et al. gardner2019deeppara are over-simplified with a limited number of light sources. This simplification loses accurate frequency information which results in unrealistic shadow and shading in the renderings. Garon et al. garon2019fast and Li et al. li2019spherical regress illumination parameters but are often constrained by the order of the representative functions (spherical harmonic and spherical Gaussian). As a result, they predict low-frequency illumination and produce renderings with inaccurate shades and shadows as illustrated in Fig. 8.

We provide more object relighting examples using images from the Internet in Fig. 9. It is clear that our model can predict illumination maps from background images reliably, which could be used to render realistic objects by an off-the-shelf render (i.e. Blender).

Methods RMSE si-RMSE
D S M D S M
Anchor=64 0.071 0.085 0.102 0.064 0.071 0.093
Anchor=196 0.053 0.062 0.081 0.036 0.050 0.075
W/o GML 0.062 0.074 0.094 0.055 0.059 0.078
W/o SConv 0.056 0.069 0.082 0.044 0.054 0.083
W/o Adp Radius 0.063 0.071 0.085 0.047 0.056 0.081
GMLight 0.051 0.064 0.078 0.037 0.049 0.074
Table 3: Ablation studies over anchor points, loss functions and convolution operators: GMLight denotes the standard setting with 128 anchor points, geometric mover’s loss (GML), and spherical convolution. We create five GMLight variants that use 64 and 196 anchor points, replace GML with cross-entropy loss (W/o GML), replace spherical convolution with standard convolution (W/o SConv), and employ a fixed radius in the Gaussian function (W/o Adp Radius), respectively (other conditions unchanged).

4.5 Ablation Study

We developed several GMLight variants as listed in Table 2 to evaluate the effectiveness of our proposed method. The variants include 1) SG+L2 (baseline) that regresses spherical Gaussian parameters with L2 loss as in gardner2017 ; 2) GD+L2 that regresses geometric distribution of illumination with L2 loss; 3) GD+SML that regresses geometric distribution of illumination with SML proposed in EMLight; 4) GD+GML that regresses geometric distribution of illumination with GML; and 5) GD+GML+GP (standard GMLight). Similar to the setting in Quantitative Evaluation, we apply all variant models to render 200 images of the testing scene. As Table 2 shows, GD+L2 outperforms SG+L2 clearly, demonstrating the superiority of geometric distributions in representing illuminations. GD+GML also produces better estimation than GD+L2 and GD+SML, validating the effectiveness of our designed GML. GD+GML+GP performs the best, demonstrating that the generative projector improves illumination predictions clearly.

We also benchmark geometric mover’s loss (GML) with the widely adopted Cross-Entropy Loss for distribution regression, compare the spherical convolution with vanilla convolution, and study how anchor points affect lighting estimation. We followed the experimental setting in Table 2 and Table 3 shows experimental results in averaged RMSE and si-RMSE on three materials. We can see that GML outperforms cross-entropy loss clearly as GML captures spatial information of geometric distributions effectively. In addition, spherical convolution performs better than vanilla convolution consistently in panoramic image generation. Further, the prediction performance drops slightly when 64 instead of 128 anchor points are used, and increasing anchor points to 196 doesn’t improve the performance obviously. We conjecture that the larger number of parameters with 196 anchor points affects the regression accuracy negatively.

Figure 10: Estimated spatially-varying illumination maps at different insertion positions (center, left, right, up, and down) by GMLight on Laval Indoor dataset gardner2017 .
Figure 11: Illustration of spatially-varying illumination estimation by GMLight: Images in column 1 are background images garon2019fast for 3D object insertion, where the silver sphere in each image shows real illuminations at one specific position. Images in column 2 show object relighting at different insertion positions. With GMLight-estimated illuminations, the relighting at different positions is well aligned with that of the silver spheres.
Methods RMSE si-RMSE
Left Center Right Left Center Right
Gardner et al. gardner2017 0.168 0.176 0.171 0.148 0.159 0.152
Gardner et al. gardner2019deeppara 0.102 0.114 0.104 0.085 0.097 0.087
Garon et al. garon2019fast 0.186 0.199 0.182 0.174 0.184 0.173
GMLight 0.059 0.066 0.058 0.043 0.051 0.041
Table 4: Quantitative comparison of spatially-varying illumination estimated by GMLight and state-of-the-art methods: Evaluations were performed for three insertion positions (left, center, and right) over images from the Laval Indoor dataset gardner2017 . The scores of the three spheres in Fig. 7 are averaged to obtain the final score.

4.6 Spatially-varying Illumination

Spatially-varying illumination prediction aims to recover the illumination at different positions of a scene. Fig. 10 show the spatially-varying illumination maps that are predicted at different insertion positions (center, left, right, up, and down) by GMLight. It can be seen that GMLight estimates illumination maps of different insertion positions nicely, largely due to the auxiliary depth branch that estimates scene depths and recovers the scene geometry effectively. We also evaluate spatially-varying illumination estimation quantitatively and Table 4 shows experimental results. We can see that GMLight outperforms other methods consistently in all insertion positions. The superior performance is largely attributed to the accurate geometry modeling of lighting distributions with scene depths.

Fig. 11 illustrates the 3D insertion results with the estimated spatially-varying illuminations. The sample images are from garon2019fast , where a silver sphere is employed to indicate spatially-varying illuminations at different scene positions which serve as references for evaluating the realism of 3D insertion. As Fig. 11 shows, the inserted objects (clock) at different positions present consistent shading and shadow effect with the silver sphere, demonstrating the high-quality estimation of spatially-varying illuminations by GMLight.

5 Conclusion

This paper presents GMLight, a lighting estimation framework that consists of a regression network and a generative projector. In GMLight, we formulate the illumination prediction as a discrete distribution regression problem within a geometric space and design a geometric mover’s loss to achieve the effective regression of geometric light distribution. To generate accurate illumination maps with realistic frequency (especially high frequency), we introduce a novel generative projector with progressive guidance that synthesizes panoramic illumination maps through adversarial training. Quantitative and qualitative experiments show that GMLight is capable of predicting illumination accurately from a single indoor image. We will continue to investigate illumination estimation from the perspective of geometric distributions in our future works.

References

  • (1) Barron, J.T., Malik, J.: Intrinsic scene properties from a single rgb-d image. TPAMI (2015)
  • (2) Boss, M., Jampani, V., Kim, K., Lensch, H.P., Kautz, J.: Two-shot spatially-varying brdf and shape estimation. In: CVPR (2020)
  • (3) Chen, Z., Chen, A., Zhang, G., Wang, C., Ji, Y., Kutulakos, K.N., Yu, J.: A neural rendering framework for free-viewpoint relighting. arXiv:1911.11530 (2019)
  • (4) Cheng, D., Shi, J., Chen, Y., Deng, X., Zhang, X.: Learning scene illumination by pairwise photos from rear and front mobile cameras. Computer Graphics Forum (2018)
  • (5) Chizat, L., Peyré, G., Schmitzer, B., Vialard, F.X.: Scaling algorithms for unbalanced transport problems. In: arXiv:1607.05816 (2016)
  • (6) Coors, B., Condurache, A.P., Geiger, A.: Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In: ECCV (2018)
  • (7) Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transport. In: NIPS (2013)
  • (8) Gardner, M.A., Hold-Geoffroy, Y., Sunkavalli, K., Gagné, C., Lalonde, J.F.: Deep parametric indoor lighting estimation. In: ICCV (2019)
  • (9) Gardner, M.A., Sunkavalli, K., Yumer, E., Shen, X., Gambaretto, E., Gagné, C., Lalonde, J.F.: Learning to predict indoor illumination from a single image. In: SIGGRAPH Asia (2017)
  • (10) Garon, M., Sunkavalli, K., Hadap, S., Carr, N., Lalonde, J.F.: Fast spatially-varying indoor lighting estimation. In: CVPR (2019)
  • (11) Gupta, M., Tian, Y., Narasimhan, S.G., Zhang, L.: A combined theory of defocused illumination and global light transport. International Journal of Computer Vision 98(2), 146–167 (2012)
  • (12) Hess, R.: Blender Foundations: The Essential Guide to Learning Blender 2.6. Focal Press (2010)
  • (13) Hold-Geoffroy, Y., Sunkavalli, K., Hadap, S., Gambaretto, E., Lalonde, J.F.: Deep outdoor illumination estimation. In: CVPR (2017)
  • (14) Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks.

    In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708 (2017)

  • (15)

    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks.

    In: CVPR (2017)
  • (16) Karsch, K., Hedau, V., Forsyth, D., Hoiem, D.: Rendering synthetic objects into legacy photographs. TOG (2011)
  • (17) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • (18) Lalonde, J.F., Efros, A.A., Narasimhan, S.G.: Estimating the natural illumination conditions from a single outdoor image. IJCV (2012)
  • (19) LeGendre, C., Ma, W.C., Fyffe, G., Flynn, J., Charbonnel, L., Busch, J., Debevec, P.: Deeplight: Learning illumination for unconstrained mobile mixed reality. In: CVPR (2019)
  • (20) Li, M., Guo, J., Cui, X., Pan, R., Guo, Y., Wang, C., Yu, P., Pan, F.: Deep spherical gaussian illumination estimation for indoor scene. In: MM Asia (2019)
  • (21) Li, Z., Shafiei, M., Ramamoorthi, R., Sunkavalli, K., Chandraker, M.: Inverse rendering for complex indoor scenes shape, spatially varying lighting and svbrdf from a single image. In: CVPR (2020)
  • (22) Liao, Z., Karsch, K., Zhang, H., Forsyth, D.: An approximate shading model with detail decomposition for object relighting. International Journal of Computer Vision 127(1), 22–37 (2019)
  • (23) Liu, D., Long, C., Zhang, H., Yu, H., Dong, X., Xiao, C.: Arshadowgan: Shadow generative adversarial network for augmented reality in single light scenes. In: CVPR (2020)
  • (24) Lombardi, S., Nishino, K.: Reflectance and illumination recovery in the wild. TPAMI (2016)
  • (25) Maier, R., Kim, K., Cremers, D., Kautz, J., Nießner, M.: Intrinsic3d: High-quality 3d reconstruction by joint appearance and geometry optimization with spatially-varying lighting. In: ICCV (2017)
  • (26)

    Maurer, D., Ju, Y.C., Breuß, M., Bruhn, A.: Combining shape from shading and stereo: A joint variational method for estimating depth, illumination and albedo.

    International Journal of Computer Vision 126(12), 1342–1366 (2018)
  • (27) Murmann, L., Gharbi, M., Aittala, M., Durand, F.: A dataset of multi-illumination images in the wild. In: ICCV (2019)
  • (28) Ngo, T.T., Nagahara, H., Nishino, K., Taniguchi, R.i., Yagi, Y.: Reflectance and shape estimation with a light field camera under natural illumination. International Journal of Computer Vision 127(11), 1707–1722 (2019)
  • (29) Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: CVPR (2019)
  • (30)

    Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval.

    IJCV (2000)
  • (31)

    Sajjadi, M.S., Scholkopf, B., Hirsch, M.: Enhancenet: Single image super-resolution through automated texture synthesis.

    In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4491–4500 (2017)
  • (32) Sengupta, S., Gu, J., Kim, K., Liu, G., Jacobs, D.W., Kautz, J.: Neural inverse rendering of an indoor scene from a single image. In: ICCV (2019)
  • (33) Song, S., Funkhouser, T.: Neural illumination: Lighting prediction for indoor environments. In: CVPR (2019)
  • (34) Srinivasan, P.P., Mildenhall, B., Tancik, M., Barron, J.T., Tucker, R., Snavely, N.: Lighthouse: Predicting lighting volumes for spatially-coherent illumination. In: CVPR (2020)
  • (35) Sun, T., Barron, J.T., Tsai, Y.T., Xu, Z., Yu, X., Fyffe, G., Rhemann, C., Busch, J., Debevec, P., Ramamoorthi, R.: Single image portrait relighting. In: TOG (2019)
  • (36) Vogel, H.: A better way to construct the sunflower head. Mathematical biosciences (1979)
  • (37) Xue, C., Lu, S., Zhan, F.: Accurate scene text detection through border semantics awareness and bootstrapping. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 355–372 (2018)
  • (38) Zhan, F., Huang, J., Lu, S.: Adaptive composition gan towards realistic image synthesis. arXiv preprint arXiv:1905.04693 (2019)
  • (39) Zhan, F., Lu, S.: Esir: End-to-end scene text recognition via iterative image rectification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2059–2068 (2019)
  • (40) Zhan, F., Lu, S., Xue, C.: Verisimilar image synthesis for accurate detection and recognition of texts in scenes. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 249–266 (2018)
  • (41) Zhan, F., Lu, S., Zhang, C., Ma, F., Xie, X.: Adversarial image composition with auxiliary illumination. In: Proceedings of the Asian Conference on Computer Vision (2020)
  • (42) Zhan, F., Lu, S., Zhang, C., Ma, F., Xie, X.: Towards realistic 3d embedding via view alignment. arXiv preprint arXiv:2007.07066 (2020)
  • (43) Zhan, F., Xue, C., Lu, S.: Ga-dan: Geometry-aware domain adaptation network for scene text detection and recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9105–9115 (2019)
  • (44) Zhan, F., Zhang, C.: Spatial-aware gan for unsupervised person re-identification. Proceedings of the International Conference on Pattern Recognition (2020)
  • (45) Zhan, F., Zhang, C., Yu, Y., Chang, Y., Lu, S., Ma, F., Xie, X.: Emlight: Lighting estimation via spherical distribution approximation. arXiv preprint arXiv:2012.11116 (2020)
  • (46) Zhan, F., Zhu, H., Lu, S.: Scene text synthesis for efficient and effective deep network training. arXiv preprint arXiv:1901.09193 (2019)
  • (47) Zhan, F., Zhu, H., Lu, S.: Spatial fusion gan for image synthesis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3653–3662 (2019)
  • (48) Zhang, E., Cohen, M.F., Curless, B.: Emptying, refurnishing, and relighting indoor spaces. TOG (2016)
  • (49) Zoran, D., Krishnan, D., Bento, J., Freeman, B.: Shape and illumination from shading using the generic viewpoint assumption. In: NIPS (2014)