1 Introduction
Lighting estimation aims to recover panoramic illumination from a single image with a limited field of view. It has a wide range of applications in various computer vision and computer graphics tasks such as object relighting in mixed reality, etc. However, lighting estimation is a typical underconstrained problem since it aims to recover a fullview illumination map from an image with a limited field of view. In addition, highdynamicrange (HDR) illumination needs to be inferred from lowdynamicrange (LDR) observations for the purpose of realistic object relighting.
Lighting estimation has been investigated by direct generation of illumination maps gardner2017 ; song2019 ; srinivasan2019lighthouse or regression of representation parameters, such as spherical harmonic parameters cheng2018shlight ; garon2019fast and spherical Gaussian parameters gardner2019deeppara ; li2020rendering . However, representationbased methods often struggle to accurately regress frequency information (especially highfrequency information), which often leads to inaccurate shading and shadow effects in object relighting garon2019fast ; li2019spherical . Meanwhile, directly generation methods enable certain highfrequency information to be regressed, but often leads to poor generalization capability gardner2017 ; chen2019neural .
In our earlier work zhan2021emlight , we designed a network named EMLight to model lighting with spherical distributions, and introduced a spherical mover’s loss to regress the spherical distribution parameters. However, EMLight employs a simplified spherical surface to represent the geometric space when modeling a lighting distribution, ignoring the potential depth in realworld illumination scenes. As a result, it often suffers from low geometry accuracy in light estimation and cannot handle spatiallyvarying illuminations. In this work, we propose Geometric Mover’s Light (GMLight) to model illuminations by geometric distributions which are closer to realworld illumination distributions (as compared with the simplified spherical distribution zhan2021emlight ) and thus provide higher accuracy. Specially, we introduce auxiliary depths to capture scene geometry, which enables effective estimation of spatiallyvarying illumination. Furthermore, we design a progressive Gaussian guidance scheme for lighting generation which significantly outperforms the fixed Gaussian guidance in EMLight.
Specifically, the GMLight (Geometric Mover’s Light) introduced in this work represents lighting by discrete distributions in a geometric space (defined by scene depths as illustrated in Fig. 1) and formulates lighting estimation as a distribution regression task. It consists of a regression network for lighting parameters prediction and a generative projector for illumination map synthesis. For lighting parameters estimation, we propose a distributionbased representation method to decompose illumination scene into four components: depth value, light distribution, light intensity, ambient term. Note that the last two are scalars and can be directly regressed with a naive L2 loss. Light distributions and scene depths, in contrast, are spatiallylocalized in scenes, and therefore are not suitable to be regressed by the naive L2 loss as it does not capture geometry information. Inspired by the earth mover’s distances emd between distributions, we design a Geometric Mover’s Loss (GML), that regresses light distributions and scene depths with an ‘earth mover’ in a geometric space. GML aims to search for an optimal plan to move one distribution to another with minimal moving distance. It can effectively capture spatial information when regressing light distributions and scene depths.
With the estimated illumination scene parameters, the generative projector generates illumination maps with realistic frequency and appearance in an adversarial manner. In detail, spherical Gaussian function gardner2019deeppara is adopted to map the estimated light parameters into a Gaussian map (a panoramic image), which serves as the guidance in illumination map generation. The Gaussian map can be constructed at each position in a scene (for spatiallyvarying illuminations) with knowledge of the scene geometry as reflected by depth values. To provide effective guidance at different generation stages, we adopt Gaussian functions of different radii to generate Gaussian maps in a coarsetofine manner. Since illumination maps are panoramic images that usually suffer from spherical distortions at different latitudes, we adopt spherical convolutions spherenet to accurately generate panoramic illumination maps. More details will be provided in the ensuing subsections.
2 Related Works
Lighting estimation is a classic challenge in computer vision and computer graphics, and it is critical for realistic relighting during object insertion lalonde2012 ; geoffroy2017 ; murmann2019dataset ; boss2020 ; ngo2019reflectance ; liao2019approximate ; maurer2018combining ; gupta2012combined and image synthesis zhan2019sfgan ; zhan2018verisimilar ; zhan2019esir ; zhan2019gadan ; zhan2019acgan ; zhan2019scene ; zhan2020towards ; zhan2020sagan ; xue2018accurate . Traditional approaches require user intervention or assumptions about the underlying illumination model, scene geometry, etc. For example, karsch2011 recovers parametric 3D lighting from a single image but requires user annotations for initial lighting and geometry estimates. zhang2016 requires a full multiview 3D reconstruction of scenes. lombardi2016 estimates illumination from an object of known shape with a lowdimensional model. Maier et al. maier2017 make use of additional depth information to recover spherical harmonics (SH) illumination. Sengupta et al. sengupta2019 jointly estimate the albedo, normals, and lighting of an indoor scene from a single image. Barron et al. barron2015 estimate shape, lighting, and material but rely on datadriven priors to compensate for the lack of geometry information. Zoran et al. zoran2014 effectively estimate illumination from a single image by leveraging a generic viewpoint.
On the other hand, more recent works aim to estimate lighting from images by regressing representation parameters cheng2018shlight ; gardner2019deeppara ; li2020rendering or generating illumination maps gardner2017 ; song2019 . Cheng et al. cheng2018shlight regress the SH parameters of global lighting with a rendering loss. Gardner et al. gardner2017 generate illumination maps directly with a twosteps training strategy. Gardner et al. gardner2019deeppara estimate the positions, intensities, and colors of light sources and reconstruct illumination maps with a spherical Gaussian function. Garon et al. garon2019fast estimate lighting by predicting SH coefficients from a background image and local patch. Song et al. song2019 estimate perpixel 3D geometry and use a convolutional network to predict unobserved content in the environment map. On top of this, Li et al. li2019spherical represent illumination maps with multiple spherical Gaussian functions and regress the corresponding Gaussian parameters for lighting estimation. Legendre et al. legendre2019deeplight regress HDR lighting from LDR images by comparing the sphere rendered with the predicted illumination to the ground truth. Srinivasan et al. srinivasan2019lighthouse estimate a 3D volumetric RGB model of a scene and uses standard volume rendering to estimate incident illuminations. Given any illumination map, the framework proposed by sun2019 is able to achieve relighting on an RGB portrait image taken in an unconstrained environment. Besides, several works liu2020shadow ; zhan2020aicnet adopt generative adversarial network to generate shadows without explicitly estimating the illumination map.
However, the aforementioned works either lose realistic frequency information or produce inaccurate light sources during illumination estimation. In contrast, we decompose illumination maps into four components and design a regression network with a geometric mover’s loss to estimate the decomposed illumination parameters with accurate spatial information. Using the estimated illumination parameters as guidance, our designed generative projector generates illumination maps accurately with realistic frequency information.
3 Proposed Method
GMLight consists of two sequential modules, including a regression network and a generative projector, as illustrated in Figs. 2 and 5. We propose a distributionbased representation to parameterize illumination scenes with four sets of parameters including light distribution , light intensity , ambient term , and depth , which are to be estimated by the regression network. The illumination scene parameters estimated by the regression network serve as guidance for the generative projector to synthesize realistic illumination maps.
3.1 DistributionBased Illumination Representation
Fig. 2 illustrates our proposed distributionbased illumination representation. Given an image, we first separate the light sources from its HDR illumination map since the light sources in a scene are most critical for illumination prediction. Following gardner2019deeppara , we separate light sources by the top 5% of pixels with the highest values in the HDR illumination map. The light intensity and the ambient term can then be determined by the sum of all lightsource pixels and the average of ambient region pixels, respectively. As in vogel1979 , we generate (
= 128 by default in this work) anchor points that are uniformly distributed on a unit sphere as illustrated in Fig.
3. Lightsource pixels are assigned to their corresponding anchor point based on the minimum radian distance, and further determine the lightsource value of the anchor point by summing all its affiliated pixels. Afterward, the value of lightsource pixels is normalized by the intensity so that the anchor points form a standard discrete distribution on a unit sphere (i.e. the light distribution ). Since Laval Indoor dataset gardner2017 provides pixellevel depth annotations, the depth at each anchor point can be determined by averaging depth values of all the pixels that are assigned to the anchor point (similarly based on the minimum radian distance).3.2 Illumination Parameter Prediction
The structure of the regression network is shown in Fig. 2. Four branches are adopted to regress the four sets of parameters, respectively. For the light intensity and ambient term , a naive L2 loss can be adopted for the regression. However, for the light distribution and depth values that are localized on a sphere, the naive L2 loss cannot effectively utilize the spatial information of the geometric distribution and the property of the standard distribution (for light distribution , the summation of all anchor point values is equal to one). We therefore take advantage of the subtleties (spatial information) of the geometric distribution and propose a novel geometric mover’s loss to regress the light distribution and depth values.
Geometric Mover’s Loss: Although a naive L2 loss or crossentropy loss can be applied to regress the discrete distribution (namely the values of the N anchor points), these naive methods inevitably introduce several problems. Firstly, the L2 loss only regresses each anchor point separately and cannot effectively evaluate the discrepancy between two sets of anchor points (two distributions). Secondly, both the L2 loss and crossentropy loss are not able to effectively utilize the spatial information of the light distribution and depth values in parameter regression.
Inspired by the earth mover’s distance which measures the discrepancy between two measures, we propose a novel geometric mover’s loss (GML) to measure the discrepancy between two discrete geometric measures. To derive the proposed GML, we define two discrete distributions with points on the sphere, denoted by and . Intuitively, GML can be treated as the minimum total cost required to transform to , where the cost is measured by multiplying the amounts of ‘earth’ to be moved and the distance to be moved. Then a transportation plan (or moving plan) matrix of size can be defined, where each entry denotes the the amounts of ‘earth’ moved between point and point . Besides, a cost matrix of size should be defined, where each entry gives the unit cost of moving to .
In our previous work EMLight, the unit cost between point and point is measured by their radian distance along the unit sphere as shown in Fig. 4. However, the cost computation method used in EMLight simplifies the lighting distribution to be spherical and ignoring the complex geometry of real scenes. In contrast, we propose to model the lighting distribution with a geometric distance, determined by the depth map. As shown in Fig. 4, the distance from a geometric anchor point to the spherical center is the depth value instead of the radius, thus effectively reflecting the real geometry of the scene. The geometric distance (or unit cost) between anchor points and can be computed according to their depth values , and their spherical angle , as follows:
(1) 
With the defined transportation plan matrix and cost matrix , GML can be formulated as the minimum total cost for the transport between and :
(2) 
Different from light distributions, the spherical depth values don’t form a standard distribution (the sum of all depth values is not constant). Thus we introduce an unbalanced setting of GML for the regression of depth values. We handle it by introducing a relaxed version of classical earth mover’s distance, namely unbalanced earth mover’s distance chizat2016uot . It aims to determine an optimal transport plan between measures (spherical depth values) of different total masses. We formulate unbalanced GML by replacing the ‘Hard’ conservation of masses in (2) by a ‘Soft’ penalty with a divergence metric. An unbalanced GML as denoted by can thus be formulated as follows:
(3) 
where is regularization parameter,
is the KullbackLeibler divergence which is defined as:
(4) 
To solve the problem in an efficient and differentiable way, Cuturi et al. cuturi2013sinkhorn introduced an entropic regularization term defined by . Thus, the original problem (2) can be formulated as follows:
(5) 
where is the regularization coefficients denoting the smoothness of the transportation plan matrix . In our model, is empirically set to 0.0001. The unbalanced GML (3) can be regularized similarly by adding the entropic term . The regularized form of the problem (2) and (3) are then solved in an efficient and differentiable way by Sinkhorn iteration cuturi2013sinkhorn during network training.
Using GML to regress the geometric distribution has two clear advantages. First, it makes the regression sensitive to the global geometry, thus effectively penalizing predicted activation that is far from the ground truth distribution. Second, GML is smooth in training, providing stable optimization which is beneficial to the underconstraint problem of illumination prediction.
3.3 Illumination Generation
With the predicted parameters, including the light distribution , light intensity and ambient term , we can create a Gaussian map using a spherical Gaussian function gardner2019deeppara as follows:
(6) 
where is the Gaussian map, is the number of anchor points, and denotes the RGB value of an anchor point, which is the product of the light distribution on this anchor point and the light intensity (namely ). Further, is the direction of an anchor point (predefined by Vogel’s method vogel1979 ), is a unit vector giving a direction on the sphere, and is the angular size (empirically set to 0.0025).
However, the constructed Gaussian map tends to lose the realistic illumination frequency information especially for high frequencies, which leads to weak shadow and shading effects as shown in Garon et al. garon2019fast . In contrast, the adversarial loss encourages the synthesis of highfrequency details as validated in sajjadi2017enhancenet . Thus, we propose a generative projector to synthesize a realistic illumination map through adversarial training. Specifically, we formulate the synthesis of the illumination map as a conditional image generation task with paired data as illustrated in Fig. 5, where the Gaussian maps serve as the conditional input.
The overall architecture of the generation network is similar to SPADE park2019spade , as shown in Fig. 5. Instead of sampling a random vector, we encode the input image as a latent feature vector for the adversarial generation. The illumination map is a panoramic image, where pixels at different latitudes are stretched in different scales. As a result, the vanilla convolution suffers from heavy distortions at different latitudes, especially around the polar regions of the panoramic image. To address this, SphereNet spherenet
explicitly encodes invariance against latitude distortions into convolutional neural networks by adapting the sampling locations of the convolutional filters, as known as spherical convolution. We therefore adopt the spherical convolution (Spherical Conv) to generate the panoramic illumination map, effectively reversing distortions and wrapping the filters around the sphere.
In EMLight, the same conditional Gaussian map is injected into the generation process at different stages. Ideally, the conditional input should provide coarsetofine guidance for lowtohigh generation stages. Thus, we adopt different radii in the Gaussian function to generate coarsetofine Gaussian maps for different generation stages, as shown in Fig. 5. The conditional Gaussian maps are injected into the generation process at multiple stages through spatiallyadaptive normalization park2019spade (denoted by ‘F’), as illustrated in Fig. 5. More details on the proposed generative projector are provided in the supplementary file.
SpatiallyVarying Projection: Spatiallyvarying illumination prediction aims to recover the illumination of different positions in a scene from a single image. As there are no annotations for spatiallyvarying illumination in the Laval Indoor dataset, we are unable to train a spatially varying model directly. Previous research gardner2019deeppara proposed to incorporate depth values into the projection to approximate the effect of spatiallyvarying illumination. We follow a similar idea with gardner2019deeppara to achieve the estimation of spatially varying illumination, as described below.
The Gaussian map is constructed through a spherical Gaussian function, as shown in Eq. (6). When we move the insertion position by , the new direction of the anchor point can be denoted by . The depth of the original insertion position and the new position are and , which can be obtained from the predicted depth value of anchor points. The light intensity at the new insertion position can thus be approximated by , and the Gaussian map of the new insertion position can be constructed as follows:
(7) 
The Gaussian map is then fed into the generative projector to synthesize the final illumination map. Fig. 6 illustrates several samples of predicted Gaussian maps, generated illumination maps, visualized intensity maps and the corresponding ground truth. Fig. 10 illustrates the generated spatiallyvarying illumination maps at different insertion positions.
Metrics  RMSE  siRMSE  Angular Error  User Study  GMD  

D  S  M  D  S  M  D  S  M  D  S  M  N/A  
Gardner et al. gardner2017  0.146  0.173  0.202  0.142  0.151  0.174  8.12  8.37  8.81  28.0%  23.0%  20.5%  6.842 
Gardner et al. gardner2019deeppara  0.084  0.112  0.147  0.073  0.093  0.119  6.82  7.15  7.22  33.5%  28.0%  24.5%  5.524 
Li et al. li2019spherical  0.203  0.218  0.257  0.193  0.212  0.243  9.37  9.51  9.81  25.0%  21.5%  17.5%  7.013 
Garon et al. garon2019fast  0.181  0.207  0.249  0.177  0.196  0.221  9.12  9.32  9.49  27.0%  22.5%  19.0%  7.137 
Zhan et al. zhan2021emlight  0.062  0.071  0.089  0.043  0.054  0.078  6.43  6.61  6.95  40.0%  35.0%  25.0%  5.131 
GMLight  0.051  0.064  0.078  0.037  0.049  0.074  6.21  6.50  6.77  42.0%  35.5%  31.0%  4.892 
Comparison of GMLight with several stateoftheart lighting estimation methods. The evaluation metrics include the widely used RMSE, siRMSE, angular error, user study and GMD. D, S, and M denote diffuse, matte silver, and mirror materials of the rendered objects, respectively.
Loss Functions: The generative projector employs several losses to drive the generation of highquality illumination maps. We denote the input Gaussian map as , the groundtruth illumination map as , and the generated illumination map as . To stabilize the training, we introduce a feature matching loss to match the intermediate features of the discriminator between the generated illumination map and the ground truth:
(8) 
where represents the activation of layer in the discriminator and
denotes the balanced coefficients. To obtain a similar illumination distribution instead of excessively emphasizing the absolute intensity, a cosine similarity is computed between the generated illumination map and ground truth as follows:
(9) 
where is the weight of this term. The discriminator adopts the same architecture with PatchGAN isola2017pixel2pixel , thus obtaining the adversarial loss, denoted by . Then, the generative projector is optimized under the following objective:
(10) 
As the regression network and generative projector are both differentiable, the whole framework can be optimized endtoend.
4 Experiments
4.1 Datasets and Experimental Settings
We evaluate GMLight on the Laval Indoor HDR dataset gardner2017 that consists of 2,100 HDR panoramas taken in a variety of indoor environments. Similar to gardner2017 , we crop eight images with limited fields of view from each panorama, which produces 19,556 training pairs for our experiments. We apply the same image warping as in gardner2017 to each image. In our experiments, we randomly select 200 images as the testing set, and use the rest for training. In addition to the Laval Indoor HDR dataset, we also qualitatively evaluate GMLight on the dataset ^{1}^{1}1https://lvsn.github.io/fastindoorlight/ introduced in garon2019fast .
Following gardner2019deeppara and garon2019fast , we use DenseNet121 huang2017densely
as backbone in regression network. The detailed network structure of the generative projector is provided in the supplementary file. We implemented GMLight in PyTorch and adopted the Adam algorithm
kingma2014adamas the optimizer that employs a learning rate decay mechanism (the initial learning rate is 0.001). The network is trained on two NVIDIA Tesla P100 GPUs with a batch size of 4 for 100 epochs.
4.2 Evaluation Method and Metrics
Similar to the evaluation settings in Legendre et al. legendre2019deeplight , our scene for model evaluation includes three spheres made of gray diffuse, matte silver and mirror, as illustrated in Fig. 7. The performance is evaluated by comparing the scene images rendered (by Blender blender ) with the predicted illuminations and the groundtruth illumination. The evaluation metrics include root mean square error (RMSE) and scaleinvariant RMSE (siRMSE), which focus on the estimated light intensity and light directions (or shadings), respectively. Both metrics have been widely adopted in the evaluation of illumination prediction. We also adopt the perpixel linear RGB angular error legendre2019deeplight and Amazon Mechanical Turk (AMT) that performs crowdsourcing user study to subjectively assess the realism of the rendered images. In the experiments, each compared model predicts 200 illumination maps on the test set for quantitative evaluation. For qualitative evaluation, we design 25 different scenes for 3D insertion and render them with the predicted illumination maps.
Besides, we introduce a Geometric Mover’s distance (GMD) based on the geometric mover’s loss as described in (2) to measure the discrepancy between light distributions of illumination maps as below:
(11) 
where and are the normalized illumination maps, and the pixels in the maps form geometric distributions. The GMD metric is sensitive to the scene geometry with the cost matrix , thus achieving a more accurate evaluation of illumination distribution (or directions) compared with siRMSE.
4.3 Quantitative Evaluation
We compare GMLight with EMLight zhan2021emlight and several other stateoftheart methods that either generate illumination maps directly gardner2017 or estimate representative illumination parameters garon2019fast ; li2019spherical ; gardner2019deeppara . For each compared method, we render 200 images of the testing scene (three spheres made of diffuse, matte silver, mirror silver materials) using the predicted illumination of the test set of the Laval Indoor dataset. Table 1 shows the experimental results, where D, S and M denote diffuse, matte silver and mirror material objects, respectively. The AMT user study was conducted by showing two images rendered by the ground truth and each compared method to 20 users to pick a more realistic image. The score is the percentage of rendered images that are deemed more realistic than the groundtruth rendering.
As can be observed, GMLight consistently outperforms all compared methods under different evaluation metrics and materials. EMLight zhan2021emlight simplifies the lighting distribution of scenes to be spherical, ignoring the complex scene geometry. GMLight introduces depth to model scene geometry which leads to more accurate illumination estimation. gardner2017 generates illumination maps directly, but it tends to overfit training data due to the unconstrained nature of illumination estimation from a single image. gardner2019deeppara regresses the spherical Gaussian parameters of light sources, but it often loses useful frequency information and generates inaccurate shading and shadows. li2019spherical adopts spherical Gaussian functions to reconstruct the illumination maps in the spatial domain but it often loses highfrequency illumination. garon2019fast recovers lighting by regressing spherical harmonic coefficients, but their model struggles to regress lighting directions and recover highfrequency information. Although garon2019fast adopts a masked L2 loss to preserve highfrequency information, it does not fully solve the problem as illustrated in Fig. 8. In contrast, GMLight estimates illumination parameters by regressing the light distribution under a geometric mover’s loss. With the estimated parameters, the generative projector generates accurate and highfidelity illumination maps with realistic frequency information via adversarial training.
Models  RMSE  siRMSE  Angular Error  User Study  GMD  

D  S  M  D  S  M  D  S  M  D  S  M  N/A  
SG+L2  0.204  0.213  0.238  0.188  0.203  0.229  9.18  9.42  9.73  26.0%  22.5%  18.0%  5.631 
GD+L2  0.133  0.161  0.178  0.117  0.132  0.161  7.60  7.88  8.12  30.5%  25.5%  22.0%  5.303 
GD+SML  0.080  0.103  0.117  0.072  0.087  0.106  6.78  6.98  7.12  34.0%  31.5%  26.0%  5.163 
GD+GML  0.073  0.091  0.102  0.062  0.069  0.092  6.61  6.85  7.04  35.5%  32.0%  25.5%  5.031 
GD+GML+GP  0.051  0.064  0.078  0.037  0.049  0.074  6.21  6.50  6.77  42.0%  35.5%  31.0%  4.892 
Ablation study of the proposed GMLight. SG and GD denote spherical Gaussian representation and our proposed geometric distribution representation of illumination maps. L2 and GML denote L2 loss and geometric mover’s loss that are used in the regression of lighting parameters. GP denotes our proposed generative projector.
4.4 Qualitative Evaluation
We visualize our predicted Gaussian maps, generated illumination maps, and the corresponding intensity maps in Fig. 6. As can be seen, our regression network accurately predicts light distributions as shown in the Gaussian Map. The generative projector generates accurate and realistic HDR illumination maps as shown in the Generation. To further verify the quality of generated HDR illumination, we visualize the intensity maps of the illumination maps.
We qualitatively compare GMLight with four stateoftheart light estimation methods. Fig. 8 shows the images rendered with the predicted illumination maps (highlighted by red boxes). As can be observed, GMLight predicts realistic illumination maps with plausible light sources and produces realistic renderings with clear and accurate shades and shadows. In contrast, direct generation gardner2017 struggles to identify the direction of light sources as there is no guidance for the generation. Illumination maps produced by Gardner et al. gardner2019deeppara are oversimplified with a limited number of light sources. This simplification loses accurate frequency information which results in unrealistic shadow and shading in the renderings. Garon et al. garon2019fast and Li et al. li2019spherical regress illumination parameters but are often constrained by the order of the representative functions (spherical harmonic and spherical Gaussian). As a result, they predict lowfrequency illumination and produce renderings with inaccurate shades and shadows as illustrated in Fig. 8.
We provide more object relighting examples using images from the Internet in Fig. 9. It is clear that our model can predict illumination maps from background images reliably, which could be used to render realistic objects by an offtheshelf render (i.e. Blender).
Methods  RMSE  siRMSE  

D  S  M  D  S  M  
Anchor=64  0.071  0.085  0.102  0.064  0.071  0.093 
Anchor=196  0.053  0.062  0.081  0.036  0.050  0.075 
W/o GML  0.062  0.074  0.094  0.055  0.059  0.078 
W/o SConv  0.056  0.069  0.082  0.044  0.054  0.083 
W/o Adp Radius  0.063  0.071  0.085  0.047  0.056  0.081 
GMLight  0.051  0.064  0.078  0.037  0.049  0.074 
4.5 Ablation Study
We developed several GMLight variants as listed in Table 2 to evaluate the effectiveness of our proposed method. The variants include 1) SG+L2 (baseline) that regresses spherical Gaussian parameters with L2 loss as in gardner2017 ; 2) GD+L2 that regresses geometric distribution of illumination with L2 loss; 3) GD+SML that regresses geometric distribution of illumination with SML proposed in EMLight; 4) GD+GML that regresses geometric distribution of illumination with GML; and 5) GD+GML+GP (standard GMLight). Similar to the setting in Quantitative Evaluation, we apply all variant models to render 200 images of the testing scene. As Table 2 shows, GD+L2 outperforms SG+L2 clearly, demonstrating the superiority of geometric distributions in representing illuminations. GD+GML also produces better estimation than GD+L2 and GD+SML, validating the effectiveness of our designed GML. GD+GML+GP performs the best, demonstrating that the generative projector improves illumination predictions clearly.
We also benchmark geometric mover’s loss (GML) with the widely adopted CrossEntropy Loss for distribution regression, compare the spherical convolution with vanilla convolution, and study how anchor points affect lighting estimation. We followed the experimental setting in Table 2 and Table 3 shows experimental results in averaged RMSE and siRMSE on three materials. We can see that GML outperforms crossentropy loss clearly as GML captures spatial information of geometric distributions effectively. In addition, spherical convolution performs better than vanilla convolution consistently in panoramic image generation. Further, the prediction performance drops slightly when 64 instead of 128 anchor points are used, and increasing anchor points to 196 doesn’t improve the performance obviously. We conjecture that the larger number of parameters with 196 anchor points affects the regression accuracy negatively.
Methods  RMSE  siRMSE  

Left  Center  Right  Left  Center  Right  
Gardner et al. gardner2017  0.168  0.176  0.171  0.148  0.159  0.152 
Gardner et al. gardner2019deeppara  0.102  0.114  0.104  0.085  0.097  0.087 
Garon et al. garon2019fast  0.186  0.199  0.182  0.174  0.184  0.173 
GMLight  0.059  0.066  0.058  0.043  0.051  0.041 
4.6 Spatiallyvarying Illumination
Spatiallyvarying illumination prediction aims to recover the illumination at different positions of a scene. Fig. 10 show the spatiallyvarying illumination maps that are predicted at different insertion positions (center, left, right, up, and down) by GMLight. It can be seen that GMLight estimates illumination maps of different insertion positions nicely, largely due to the auxiliary depth branch that estimates scene depths and recovers the scene geometry effectively. We also evaluate spatiallyvarying illumination estimation quantitatively and Table 4 shows experimental results. We can see that GMLight outperforms other methods consistently in all insertion positions. The superior performance is largely attributed to the accurate geometry modeling of lighting distributions with scene depths.
Fig. 11 illustrates the 3D insertion results with the estimated spatiallyvarying illuminations. The sample images are from garon2019fast , where a silver sphere is employed to indicate spatiallyvarying illuminations at different scene positions which serve as references for evaluating the realism of 3D insertion. As Fig. 11 shows, the inserted objects (clock) at different positions present consistent shading and shadow effect with the silver sphere, demonstrating the highquality estimation of spatiallyvarying illuminations by GMLight.
5 Conclusion
This paper presents GMLight, a lighting estimation framework that consists of a regression network and a generative projector. In GMLight, we formulate the illumination prediction as a discrete distribution regression problem within a geometric space and design a geometric mover’s loss to achieve the effective regression of geometric light distribution. To generate accurate illumination maps with realistic frequency (especially high frequency), we introduce a novel generative projector with progressive guidance that synthesizes panoramic illumination maps through adversarial training. Quantitative and qualitative experiments show that GMLight is capable of predicting illumination accurately from a single indoor image. We will continue to investigate illumination estimation from the perspective of geometric distributions in our future works.
References
 (1) Barron, J.T., Malik, J.: Intrinsic scene properties from a single rgbd image. TPAMI (2015)
 (2) Boss, M., Jampani, V., Kim, K., Lensch, H.P., Kautz, J.: Twoshot spatiallyvarying brdf and shape estimation. In: CVPR (2020)
 (3) Chen, Z., Chen, A., Zhang, G., Wang, C., Ji, Y., Kutulakos, K.N., Yu, J.: A neural rendering framework for freeviewpoint relighting. arXiv:1911.11530 (2019)
 (4) Cheng, D., Shi, J., Chen, Y., Deng, X., Zhang, X.: Learning scene illumination by pairwise photos from rear and front mobile cameras. Computer Graphics Forum (2018)
 (5) Chizat, L., Peyré, G., Schmitzer, B., Vialard, F.X.: Scaling algorithms for unbalanced transport problems. In: arXiv:1607.05816 (2016)
 (6) Coors, B., Condurache, A.P., Geiger, A.: Spherenet: Learning spherical representations for detection and classification in omnidirectional images. In: ECCV (2018)
 (7) Cuturi, M.: Sinkhorn distances: Lightspeed computation of optimal transport. In: NIPS (2013)
 (8) Gardner, M.A., HoldGeoffroy, Y., Sunkavalli, K., Gagné, C., Lalonde, J.F.: Deep parametric indoor lighting estimation. In: ICCV (2019)
 (9) Gardner, M.A., Sunkavalli, K., Yumer, E., Shen, X., Gambaretto, E., Gagné, C., Lalonde, J.F.: Learning to predict indoor illumination from a single image. In: SIGGRAPH Asia (2017)
 (10) Garon, M., Sunkavalli, K., Hadap, S., Carr, N., Lalonde, J.F.: Fast spatiallyvarying indoor lighting estimation. In: CVPR (2019)
 (11) Gupta, M., Tian, Y., Narasimhan, S.G., Zhang, L.: A combined theory of defocused illumination and global light transport. International Journal of Computer Vision 98(2), 146–167 (2012)
 (12) Hess, R.: Blender Foundations: The Essential Guide to Learning Blender 2.6. Focal Press (2010)
 (13) HoldGeoffroy, Y., Sunkavalli, K., Hadap, S., Gambaretto, E., Lalonde, J.F.: Deep outdoor illumination estimation. In: CVPR (2017)

(14)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected
convolutional networks.
In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708 (2017)

(15)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Imagetoimage translation with conditional adversarial networks.
In: CVPR (2017)  (16) Karsch, K., Hedau, V., Forsyth, D., Hoiem, D.: Rendering synthetic objects into legacy photographs. TOG (2011)
 (17) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
 (18) Lalonde, J.F., Efros, A.A., Narasimhan, S.G.: Estimating the natural illumination conditions from a single outdoor image. IJCV (2012)
 (19) LeGendre, C., Ma, W.C., Fyffe, G., Flynn, J., Charbonnel, L., Busch, J., Debevec, P.: Deeplight: Learning illumination for unconstrained mobile mixed reality. In: CVPR (2019)
 (20) Li, M., Guo, J., Cui, X., Pan, R., Guo, Y., Wang, C., Yu, P., Pan, F.: Deep spherical gaussian illumination estimation for indoor scene. In: MM Asia (2019)
 (21) Li, Z., Shafiei, M., Ramamoorthi, R., Sunkavalli, K., Chandraker, M.: Inverse rendering for complex indoor scenes shape, spatially varying lighting and svbrdf from a single image. In: CVPR (2020)
 (22) Liao, Z., Karsch, K., Zhang, H., Forsyth, D.: An approximate shading model with detail decomposition for object relighting. International Journal of Computer Vision 127(1), 22–37 (2019)
 (23) Liu, D., Long, C., Zhang, H., Yu, H., Dong, X., Xiao, C.: Arshadowgan: Shadow generative adversarial network for augmented reality in single light scenes. In: CVPR (2020)
 (24) Lombardi, S., Nishino, K.: Reflectance and illumination recovery in the wild. TPAMI (2016)
 (25) Maier, R., Kim, K., Cremers, D., Kautz, J., Nießner, M.: Intrinsic3d: Highquality 3d reconstruction by joint appearance and geometry optimization with spatiallyvarying lighting. In: ICCV (2017)

(26)
Maurer, D., Ju, Y.C., Breuß, M., Bruhn, A.: Combining shape from shading and stereo: A joint variational method for estimating depth, illumination and albedo.
International Journal of Computer Vision 126(12), 1342–1366 (2018)  (27) Murmann, L., Gharbi, M., Aittala, M., Durand, F.: A dataset of multiillumination images in the wild. In: ICCV (2019)
 (28) Ngo, T.T., Nagahara, H., Nishino, K., Taniguchi, R.i., Yagi, Y.: Reflectance and shape estimation with a light field camera under natural illumination. International Journal of Computer Vision 127(11), 1707–1722 (2019)
 (29) Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatiallyadaptive normalization. In: CVPR (2019)

(30)
Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval.
IJCV (2000) 
(31)
Sajjadi, M.S., Scholkopf, B., Hirsch, M.: Enhancenet: Single image superresolution through automated texture synthesis.
In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4491–4500 (2017)  (32) Sengupta, S., Gu, J., Kim, K., Liu, G., Jacobs, D.W., Kautz, J.: Neural inverse rendering of an indoor scene from a single image. In: ICCV (2019)
 (33) Song, S., Funkhouser, T.: Neural illumination: Lighting prediction for indoor environments. In: CVPR (2019)
 (34) Srinivasan, P.P., Mildenhall, B., Tancik, M., Barron, J.T., Tucker, R., Snavely, N.: Lighthouse: Predicting lighting volumes for spatiallycoherent illumination. In: CVPR (2020)
 (35) Sun, T., Barron, J.T., Tsai, Y.T., Xu, Z., Yu, X., Fyffe, G., Rhemann, C., Busch, J., Debevec, P., Ramamoorthi, R.: Single image portrait relighting. In: TOG (2019)
 (36) Vogel, H.: A better way to construct the sunflower head. Mathematical biosciences (1979)
 (37) Xue, C., Lu, S., Zhan, F.: Accurate scene text detection through border semantics awareness and bootstrapping. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 355–372 (2018)
 (38) Zhan, F., Huang, J., Lu, S.: Adaptive composition gan towards realistic image synthesis. arXiv preprint arXiv:1905.04693 (2019)
 (39) Zhan, F., Lu, S.: Esir: Endtoend scene text recognition via iterative image rectification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2059–2068 (2019)
 (40) Zhan, F., Lu, S., Xue, C.: Verisimilar image synthesis for accurate detection and recognition of texts in scenes. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 249–266 (2018)
 (41) Zhan, F., Lu, S., Zhang, C., Ma, F., Xie, X.: Adversarial image composition with auxiliary illumination. In: Proceedings of the Asian Conference on Computer Vision (2020)
 (42) Zhan, F., Lu, S., Zhang, C., Ma, F., Xie, X.: Towards realistic 3d embedding via view alignment. arXiv preprint arXiv:2007.07066 (2020)
 (43) Zhan, F., Xue, C., Lu, S.: Gadan: Geometryaware domain adaptation network for scene text detection and recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9105–9115 (2019)
 (44) Zhan, F., Zhang, C.: Spatialaware gan for unsupervised person reidentification. Proceedings of the International Conference on Pattern Recognition (2020)
 (45) Zhan, F., Zhang, C., Yu, Y., Chang, Y., Lu, S., Ma, F., Xie, X.: Emlight: Lighting estimation via spherical distribution approximation. arXiv preprint arXiv:2012.11116 (2020)
 (46) Zhan, F., Zhu, H., Lu, S.: Scene text synthesis for efficient and effective deep network training. arXiv preprint arXiv:1901.09193 (2019)
 (47) Zhan, F., Zhu, H., Lu, S.: Spatial fusion gan for image synthesis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3653–3662 (2019)
 (48) Zhang, E., Cohen, M.F., Curless, B.: Emptying, refurnishing, and relighting indoor spaces. TOG (2016)
 (49) Zoran, D., Krishnan, D., Bento, J., Freeman, B.: Shape and illumination from shading using the generic viewpoint assumption. In: NIPS (2014)