1 Introduction
A longstanding problem in computer vision is to reconstruct a scene—including its shape, lighting, and material properties—from a single image. This is an illposed task: these scene factors interact in complex ways to form images and multiple combinations of these factors may produce the same image
[3]. As a result, previous work has often focused on subsets of this problem—shape reconstruction, illumination estimation, intrinsic images, etc.—or on restricted settings—single objects or objects from a specific class.Our goal is a solution to a more general problem: from a single RGB image of an arbitrary indoor scene captured under uncontrolled conditions, we seek to reconstruct geometry, spatiallyvarying surface reflectance, and spatiallyvarying lighting. This is a challenging setting: indoor scenes demonstrate the entire range of realworld appearance, including arbitrary geometry and layouts, localized light sources that lead to complex spatiallyvarying lighting effects, and complex, nonLambertian surface reflectance. In this work we take a step towards providing a completely automatic, robust, holistic solution to this problem, thereby enabling a range of scene understanding and editing tasks. For example, in Figure
1(h), we use our scene reconstruction to enable photorealistic virtual object insertion; note how the inserted glossy spheres have realistic shading, shadowing caused by scene occlusions, and even reflections from the scene.Driven by the success of deep learning methods on similar scene inference tasks (geometric reconstruction
[19], lighting estimation [20], material recognition [11]), we propose training a deep convolutional neural network to regress these scene parameters from an input image. Ideally, the trained network should learn meaningful priors on these scene factors, and jointly model the interactions between them. In this work, we present two major contributions to address this.
Training deep neural networks requires largescale, labeled training data. While datasets of real world geometry exist [17, 13], capturing real world lighting and surface reflectance at scale is nontrivial. Therefore, we use the SUNCG synthetic indoor scene dataset [53] that contains a large, diverse set of indoor scenes with complex geometry. However, the materials used in SUNCG are not realistic and the rendered images [60] are noisy. We address this by replacing SUNCG materials with highquality, photorealistic SVBRDFs from a highquality 3D material dataset [1]
. We automatically map these SVBRDFs to SUNCG materials using deep features from a material estimation network, thus preserving scene semantics. We render the new scenes using a GPUbased global illumination renderer, to create highquality input images. We also render the new scene reflectance and lighting and use it (along with the original geometry) to supervise our inverse rendering network.
An inverse rendering network would have to learn a model of image formation. The forward image formation model is well understood, and has been used in simple settings like planar scenes and single objects [18, 38, 37, 40]. Indoor scenes are more complicated and exhibit challenging light transport effects like occlusions and interreflections. We address this by using a local lighting model—spatiallyvarying spherical gaussians (SVSGs). This bakes light transport effects directly into the lighting and makes rendering a purely local computation. We leverage this to design a fast, differentiable, innetwork
rendering layer that takes our geometry, SVBRDFs and SVSGs and computes radiance values. During training, we render our predictions and backpropagate the error through the rendering layer; this fixes the forward model, allowing the network to focus on the inverse task.
To the best of our knowledge, our work is the first demonstration of scenelevel inverse rendering that truly accounts for complex geometry, lighting, materials, and light transport. Moreover, we demonstrate that we achieve results on par with stateoftheart methods focused on specific tasks. For example, the diffuse albedo reconstructed using our method is competitive with a stateoftheart intrinsic image method. Most importantly, by truly decomposing a scene into physicallybased scene factors, we enable novel capabilities like photorealistic 3D object insertion and scene editing in images acquired inthewild. Figure 2 shows two object insertion examples on real indoor scene images. Since our method solves the inverse rendering problem in a holistic way, it achieves superior performances on object insertion compared with previous stateoftheart methods [20, 6]. Figure 3 shows a material editing example, where we replace the material of a planar surface in a real image. Note that our method preserves spatiallyvarying specular highlights after changing the material. Such visual effects cannot be handled by traditional intrinsic decomposition methods.
2 Related Work
The problem of reconstructing shape, reflectance, and illumination from images has a long history in vision. It has been studied under different forms, such as intrinsic images (reflectance and shading from an image) [8] and shapefromshading (shape, and sometimes reflectance, from an image) [25]. Here, we focus on single image methods.
Single objects. Many inverse rendering methods focus on reconstructing single objects. Even this problem is illposed and many methods assume some knowledge of the object in terms of known lighting [46, 27] or geometry [41, 49]. Other methods focus on specific object classes; for example, there are many methods that reconstruct facial shape, reflectance, and illumination using lowdimensional face models [12]. Recent methods have leveraged deep networks to reconstruct complex SVBRDFs from single images (captured under unknown environments) of simpler planar scenes [18, 37], objects of a specific class [40] or homogeneous BRDFs [43]. Other methods address illumination estimation [21]. We tackle the much harder case of largescale scene modeling and do not assume scene information.
Barron and Malik [5] propose an optimizationbased approach with handcrafted priors to reconstruct shape, Lambertian reflectance, and distant illumination from a single image of an arbitrary object. Li et al. [38] tackle the same problem with a deep network and an objectspecific rendering layer. Extending these methods to scenes is nontrivial because the light transport is significantly more complex.
Largescale scenes. Previous work has looked at recognizing materials in indoor scenes [11] and decomposing indoor images into reflectance and shading layers [10, 36]. Techniques have also been proposed for single image geometric reconstruction [19] and lighting estimation [24, 20]. These methods estimate only one scene factor without modeling the rest of scene appearance, as we do.
Barron and Malik [6] reconstruct Lambertian reflectance and spatiallyvarying lighting but require an RGBD input image. Karsch et al. [32] propose a fullfledged scene reconstruction method that estimates geometry, Lambertian reflectance, and 3D lighting from a single image; however, they rely on extensive user input to annotate geometry and initialize lighting. Subsequently, they propose an automatic, renderingbased optimization method [33]
that estimates all these scene factors. However, they rely on strong heuristics for their method that are often violated in the real world leading to errors in their estimates. In contrast, we propose a deep network that learns to predict geometry, complex SVBRDFs, and lighting in an endtoend fashion.
Datasets. The success of deep networks has led to an interest in datasets for supervised training. This includes real world scans [17, 13] and synthetic shape [14] and scene [53] datasets. All these datasets are either missing or have unrealistic material and lighting specifications. We build on the SUNCG dataset to improve its quality in this regard.
Differentiable rendering. A number of recent deep inverse rendering methods have incorporated innetwork, differentiable rendering layers that are customized for simple settings: faces [51, 56], planar surfaces [18, 37], single objects [40, 38]. Some recent work has proposed differentiable generalpurpose global illumination renderers [35, 15]; unlike our more specialized, fast rendering layer, these are too expensive to use for neural network training.
3 Dataset for Complex Indoor Scenes
A largescale dataset is crucial for solving the complex task of inverse rendering in indoor scenes. It is extremely difficult, if at all possible, to acquire largescale ground truth with spatiallyvarying material, lighting and global illumination. Thus, we resort to rendering a synthetic dataset, but must overcome significant challenges to ensure utility for handling real indoor scenes at test time. Existing datasets for indoor scenes are rendered with simpler assumptions on material and lighting. In this section, we describe our approach to photorealistically map our microfacet materials to SUNCG geometries [54], while preserving semantics. Further, rendering images with SVBRDF and global illumination, as well as ground truth for spatiallyvarying lighting, is computationally intensive, for which we design a custom GPUaccelerated renderer that outspeeds Mitsuba on a modern 16core CPU by an order of magnitude.
3.1 Mapping photorealistic materials to SUNCG
Our goal is to map our materials to SUNCG geometries in a semantically meaningful way. The original materials in SUNCG dataset are represented by a Phong BRDF model [47] which is not suitable for complex materials [45]. Our materials, on the other hand, are represented by a physically motivated microfacet BRDF model [29], which consists of 1332 materials with high resolution 4096 4096 SVBRDF textures ^{1}^{1}1Please refer to Appendix D for details. This mapping problem is nontrivial: (i) specular lobes in SUNCG are not realistic [45, 55], (ii) an optimizationbased fitting collapses due to local minima leading to serious overfitting when used for learning and (iii) we must replace materials with similar semantic types while being consistent with geometry, for example, replace material on walls with other paints and on sofas with other fabrics. Thus, we devise a threestage pipeline, summarized in Figure 4.
Step 1: Tileable texture synthesis
Directly replacing SUNCG textures with our nontileable ones will create artifacts near boundaries. Most frameworks for tileable texture synthesis [39, 44] use randomized patchbased methods [4], which do not preserve structures such as sharp straight edges that are common for indoor scene materials such as bricks or wood floors. Instead, we first search for an optimal crop from our SVBRDF texture by minimizing gradients for diffuse albedo, normals and roughness perpendicular to the patch boundaries. We next find the best seam for tiling along the horizontal and vertical directions by modifying the graph cut method of [34] to encourage gradients to be similar at seams. Please refer to Appendix E for details on the energy design and examples of our texture synthesis.
Step 2: Mapping SVBRDFs
Once our materials are tileable, we must use them to replace SUNCG ones in a semantically meaningful way. Since the specular reflectance of SUNCG materials is not realistic, we do this only for diffuse textures and directly use specularity from our dataset to render images. We manually divide 633 most common diffuse textures from SUNCG and from our entire dataset into 10 categories based on appearance and semantic labels, such as fabric, stone or wood. We render both sets of diffuse textures on a planar surface under flash light and use an encoder network similar to [37] to extract features, then use nearest neighbors to map the materials. We randomly choose from the precomputed nearest neighbors to render images in our dataset.
Step 3: Mapping homogeneous BRDFs
To map homogeneous materials from SUNCG to ours, we keep the diffuse albedo unchanged and map specular Phong parameters to our microfacet model. Since the two lobes are very different, a direct fitting does not work. Instead, we compute a distribution of microfacet parameters conditioned on Phong parameters based on the mapping of diffuse textures, then randomly sample from that distribution to map specular parameters. Specifically, let be specular parameters of Phong model and be those of our microfacet BRDF. If a material from SUNCG has specular parameters , we count the number of pixels in its 10 nearest neighbors from our dataset whose specular parameters are . We sum up the number across the whole SUNCG dataset as
. The probability of material with specular parameters
given the original materials in SUNCG has specular parameters is defined as:We sample the distribution as a piecewise constant function and interpolate uniformly inside each bin to get continuous specular parameters of microfacet BRDF.
Comparative results
Figure 5 shows a few scenes rendered with Lambertian, SUNCG Phong and our BRDF models. Images rendered with Lambertian BRDF do not have any specularity, those with Phong BRDF have strong but flat specular highlights, while ours are clearly more realistic. All the materials in our rendering are perfectly tiled and assigned to the correct objects, which demonstrates the effectiveness of our material mapping pipeline.
3.2 Spatially Varying Lighting
To enable tasks such as object insertion or material editing, we must estimate lighting at every spatial location that encodes complex global interactions. Thus, our dataset must also render such ground truth. We do so by rendering a environment map at the corresponding 3D point on object surfaces at every pixel.
In Figure 8, we show that an image obtained by integrating the product of this lighting and BRDF over the hemisphere looks very realistic, with high frequency specular highlights being correctly rendered. Note that global illumination and occlusion have already been baked into per pixel lighting, which makes it possible for a model trained on our lighting dataset to reason about those complex effects.
Enhancing lighting variations
The SUNCG dataset is rendered with only one outdoor environment map and two area light intensities ( for light bulbs and for light shades). We add variations to the lighting to ensure generalizability, using HDR outdoor panoramas from [23] and random RGB intensities for area lights.
3.3 Fast PhysicallyBased Rendering
To render high quality images with realistic appearances, it is necessary to use a physically based renderer that models complex light transport effects such as global illumination and soft shadows. However, current open source CPU renderers are too slow for creating a large dataset, especially to render perpixel lighting. Thus, we implement our own physicallybased GPU renderer using Nvidia OptiX [2]. To render a 480 640 image with 16384 samples per pixel, our renderer on Tesla V100 GPU needs 36 minutes, while Mitsuba on 16 cores of Intel i76900K CPU needs around hour. Figure 6 compares images rendered with Mitsuba [26] and with our renderer using the same amount of time.
Rendered dataset
We render HDR images, with used for training and for testing. The resolution of each image is . We also render per pixel groundtruth lighting for images in the training set and all images in the testing set, at a spatial resolution of . Our dataset and renderer will be made publicly available.
4 Network Design
Estimating material, geometry and lighting from a single indoor image is an extremely illposed problem, which we solve using priors learned by our physicallymotivated deep network (architecture shown in Figure 7). Our network consists of cascaded stages of a SVBRDF and geometry predictor, a spatiallyvarying lighting predictor and a differentiable rendering layer, followed by a bilateral solver for refinement.
Material and geometry prediction
The input to our network is a single gammacorrected low dynamic range image , stacked with a predicted threechannel segmentation mask that separates pixels of object, area lights and environment map, where () represents predictions. The mask is obtained through a pretrained network and useful since some predictions are not defined everywhere (for example, BRDF is not defined on light sources). Inspired by [37, 38], we use a single encoder to capture correlations between material and shape parameters, obtained using four decoders for diffuse albedo (), roughness (), normal () and depth (). Skip links are used for preserving details. Then the initial estimates of material and geometry are given by
(1) 
Spatially Varying Lighting Prediction
Inverse rendering for indoor scenes requires predicting spatially varying lighting for every pixel in the image. Using an environment map as the lighting representation leads to a very high dimensional output space, that causes memory issues and unstable training due to small batch sizes. Spherical harmonics are a compact lighting representation that have been used in recent works [28, 38], but do not efficiently recover high frequency lighting necessary to handle specular effects [48, 9]. Instead, we follow precomputed radiance transfer methods [57, 22, 59] and use isotropic spherical Gaussians that approximate allfrequency lighting with a smaller number of parameters. We model the lighting as a spherical function approximated by the sum of spherical Gaussian lobes:
(2) 
where and
are vectors on the unit sphere
, controls RGB color intensity and controls the bandwidth.Each spherical Gaussian lobe is represented by 6 parameters . Figure 8 compares the images rendered with a 12spherical Gaussian lobes approximation (72 parameters) and a fourthorder spherical harmonics approximation (75 parameters). Quantitative comparisons of lighting approximation and rendering errors are in Table 1. It is evident that even using fewer parameters, the spherical Gaussian lighting performs better, especially close to specular regions.
Our novel lighting prediction network, , accepts predicted material and geometry as input, along with the image. It uses a shared encoder and separate decoders with a tanh layer to predict:
(3) 
These original low dynamic range parameters are mapped to high dynamic range parameters , and
through following nonlinear transformation.
(4)  
(5)  
(6) 
Thus, our final predicted lighting is HDR, which is important for applications like relighting and material editing.
Lighting ()  Image ()  

SH (75 para.)  4.43  
SG (72 para.)  1.56 
Differentiable rendering layer
Our dataset in Section 3 provides ground truth for all scene components. But to model realistic indoor scene appearance, we additionally use a differentiable innetwork rendering layer to mimic the image formation process, thereby weighting those components in a physically meaningful way. We implement this layer by numerically integrating the product of SVBRDF and spatiallyvarying lighting over the hemisphere. Let be a set of light directions sampled over the upper hemisphere, and be the view direction. The rendering layer computes the diffuse image and specular image as:
(7)  
(8) 
where is the differential solid angle. We sample lighting directions. While this is relatively low resolution, we empirically find, as shown in Figure 8, that it is sufficient to recover most high frequency lighting effects.
Loss Functions
Our loss functions incorporate physical insights. We first observe that two ambiguities are difficult to resolve: the ambiguity between color and light intensity, as well as the scale ambiguity of single image depth estimation. Thus, we allow the related loss functions to be scale invariant. For material and geometry, we use the scale invariant
loss for diffuse albedo (), loss for normal () and roughness () and a scale invariant encoded loss for depth () due to its high dynamic range:(9) 
where is a scale factor computed by least squares regression. For lighting estimation, we find supervising both the environment maps and spherical Gaussian parameters is important for preserving high frequency details. Thus, we compute groundtruth spherical Gaussian lobe parameters by approximating the groundtruth lighting using the LBFGS method^{2}^{2}2Please refer to Appendix F for details on how we compute groundtruth spherical Gaussian parameters.. We use the same scale invariant encoded loss as (9) for weights (), bandwidth () and lighting (), with an loss for direction . We also add a a scale invariant rendering loss:
(10) 
where and are rendered using (7) and (8), respectively, while and are positive scale factors computed using least square regression. The final loss function is a weighted summation of the proposed losses:
(11)  
Refinement using bilateral solver
We use an endtoend trainable bilateral solver to impose a smoothness prior [7, 36]. The inputs to a bilateral solver include the prediction, the estimated diffuse albedo as a guidance image, and confidence map . We train a shallow network with three sixteenchannel layers for confidence map predictions. Let be the bilateral solver and be the network for confidence map predictions where . We do not find refinement to have much effect on normals. The refinement process is:
(12)  
(13) 
where we use for predictions after refinement.
Cascade Network
Akin to recent works on high resolution image synthesis [31, 16] and inverse rendering [38], we introduce a cascaded network that progressively increases resolution and iteratively refines the predictions through global reasoning. We achieve this by sending both the predictions and the rendering layer applied on the predictions to the next cascade stages, for material and geometry and for lighting, so that the network can reason about their differences.
(14)  
(15)  
Cascade stages have similar architectures as their initial network counterparts. One thing to notice is that we send low dynamic range lighting predictions , , instead of the high dynamic range predictions, because we observe that it makes training more stable.
Training Details
It is hard to train our whole pipeline endtoend from scratch due to limited GPU memory, even with the use of group normalization [58]. So, we first train and separately with large batch sizes, finetune them together with smaller batch sizes, and finally train the bilateral solver. Please refer to Appendix G
for training details and hyperparameter choices.
5 Experiments
Our experiments highlight the effectiveness of our dataset and network for single image inverse rendering in indoor scenes, through shape, material and lighting estimation. We achieve high accuracy on synthetic data and competitive performance on real images with respect to methods that focus only on a subset of those tasks. We conduct studies on the roles of various components in our pipeline. Finally, we illustrate applications such as high quality object insertion and material editing in real images that can only be enabled by our holistic solution to inverse rendering.
5.1 Analysis of Network and Training Choices
We study the effect of the cascade structure, joint training and refinement. Quantitative results for material and geometry predictions on the proposed dataset are summarized in Table 2, while those for lighting are shown in Table 3.
Cascade
The cascade structure leads to clear gains for shape, BRDF and lighting estimation by iteratively improving and upsampling our predictions in Tables 2 and 3. This holds for both real data and synthetic data, as shown in Figure 9 and Figure 10. We observe that the cascade structure can effectively remove noise and preserve high frequency details for both materials and lighting. The errors in our shape, material and lighting estimates are low enough to photorealistically edit the scene to insert new objects, while preserving global illumination effects. In Figure 10, we observe that the image rendered using our predicted material, shape and lighting closely match the input image.
Joint training for inverse rendering
Next we study whether BRDF, shape and lighting predictions can help improve each other. We compare jointly training the whole pipeline (“Joint”) using the loss in (11) and compare to independently training (“Ind”) each component and . Quantitative errors on Tables 2 and 3 show that while errors for shape and BRDF prediction remain similar, those for rendering and lighting decrease. Next, we test lighting predictions without predicted BRDF as input for the first level of cascade (“No MG”). Both quantitative results in Table 3 and qualitative comparison in Figure 11 demonstrate that the predicted BRDF and shape are important for the network to recover spatially varying lighting. We can see without the predicted material and geometry as input, the predicted lighting—especially the ambient color—does not sufficiently adapt spatially to the scene (possibly because of ambiguities between lighting and surface reflectance). This justifies our choice of jointly reasoning about shape, material and lighting. We also test lighting predictions with and without groundtruth SVSG parameters as supervision (“No SG”), finding that direct supervision leads to a sharper lighting prediction, which is shown in Figure 11.
Refinement
Finally, we study the impact of the bilateral solver. Quantitative improvements over the second cascade stage in Table 2 are modest, which indicates that the network already learns good smoothness priors by that stage. This is shown in Figure 10, where the second level of cascade network generates smooth predictions for both material and lighting. But we find the qualitative impact of the bilateral solver to be noticeable on real images (for example, diffuse albedo in Figure 9), thus, we use it in all our real experiments.
Qualitative examples
In Figure 12, we use a single input image from our synthetic test set to demonstrate depth, normal, SVBRDF and spatiallyvarying lighting estimation. The effectiveness is illustrated by low errors with respect to ground truth. Accurate shading and global illumination effects on an inserted object, as well as photorealistic editing of scene materials, show the utility of our decomposition.
Cascade 0  Cascade 1  
Ind.  Joint  Ind.  Joint  BS  
1.28  1.28  1.18  1.18  1.16  
4.91  4.91  4.91  4.51  4.51  
1.72  1.72  1.72  1.72  1.70  
8.06  8.00  7.29  7.26  7.20 
Cascade 0  Cascade 1  
No MG  No SG  Ind.  Joint  Ind.  Joint  
2.83  2.85  2.54  2.50  2.49  2.43  
5.00  1.56  1.56  1.06  1.92  1.11 
5.2 Comparisons with Previous Works
We address the problem of holistic inverse rendering with spatiallyvarying material and lighting which has not been tackled earlier. Yet, it is instructive to compare our approach to prior ones that focus on specific subproblems.
Method  Training Set  WHDR 

Ours (cascade 0)  Ours  23.29 
Ours (cascade 1)  Ours  21.99 
Ours (cascade 0)  Ours + IIW  16.83 
Ours (cascade 1)  Ours + IIW  15.93 
Li. et al[36]  CGI + IIW  17.5 
Intrinsic decomposition
We compare two versions of our method on the IIW dataset [10] for intrinsic decomposition evaluation: our network trained on our data alone and our network finetuned on the IIW dataset. The results are tabulated in Table 4. We observe that the cascade structure is beneficial. We also observe a lower error compared to the prior work of [36], which indicates the benefit of our dataset that is rendered with a higher photorealism, as well as a network design that closely reflects physical image formation.
Lighting estimation
We first compare to the method of Barron et al. [6] on our test set. Our scaleinvariant shading errors on {R, G, B} channels are , compared to their . Our shape, material and spatiallyvarying lighting estimation, together with a physicallymotivated network trained on a realistic largescale dataset, lead to this large improvement. Qualitative comparisons are shown in Figure 13, where we render specular spheres into the image using different lighting predictions. While our method can clearly capture the complex lighting variations and high frequency components, the spheres rendered with the predicted lighting of [6] are diffuse and have similar intensity across different regions of the image. Since only spherical harmonics parameters for log shading are predicted by [6], there is no physically correct way to turn its estimated spherical harmonics into environment lighting. Therefore, it cannot handle shadows and interreflections between object and the scene. Further, since only two orders of spherical harmonic parameters are predicted by [6], it cannot handle high frequency lighting.
Next, we also compare with the work of Gardner et al. [20], which predicts a single environment lighting for the whole indoor scene. Quantitative results on our test set show that their mean error across the whole image is 3.34 while our error is 2.43. Qualitative results are shown in Figure 13. Since only one lighting for the whole scene is predicted by [20], no spatiallyvarying lighting effects can be observed.
In Figure 14, we compare our method with [5] and [20] on several real examples for object insertion in an image with spatiallyvarying illumination. It is clear that our method achieves a significant improvement in object insertion.
Method  Mean()  Median()  Depth(Inv.) 

Ours (cascade 0)  27.08  21.14  0.217 
Ours (cascade 1)  26.33  20.21  0.206 
Depth and normal estimation
We finetune our network, trained on our synthetic dataset, on NYU dataset as discussed in Appendix G. The test error on NYU dataset is summarized in Table 5. When testing depth error, we do not consider groundtruth depth value smaller than 1 or larger than 10 since they are outside the valid range of the sensor. When testing normal error, we mask out regions without accurate groundtruth normals. For both depth and normal prediction, the cascade structure consistently helps improve performance. Zhang et al. [60] achieve stateoftheart performance for normal estimation using a more complex finetuning strategy by choosing images with similar appearance as NYU dataset and with more than six times as much training data. Eigen et al. [19] achieve better results by using 120K frames of raw video data to train their network, while we pretrain on synthetic images with larger domain gap, using only use 795 images from NYU dataset for finetuning. Although we do not achieve competitive performance on this task, it’s not our main focus. Rather, we illustrate the wide utility of our proposed dataset and demonstrate estimation of factors of image formation good enough to support photorealistic augmented reality applications.
5.3 Novel Applications
Learning a disentangled shape, SVBRDF and spatiallyvarying lighting representation allows new applications that were hitherto not possible. We consider two of them here, object insertion and material editing. Before we discuss the two applications, we first describe how we resolve the ambiguity between scales of lighting and diffuse albedo.
Scales of lighting and diffuse albedo
We use scale invariant loss for both diffuse albedo and lighting prediction. However, for real applications, we need to recover the scale of both diffuse albedo and lighting. Let and be the coefficients of diffuse albedo and lighting, respectively. Recall that our rendering layer outputs a diffuse image and a specular image . We can compute coefficients and to minimize the error between and input image . Since our specular albedo is a constant, the scaling factor for our lighting prediction will be and we have
(16)  
(17) 
However, for some images, specularity might be hard to observe, in which case we neglect the specularity term and simply compute the coefficient using
(18)  
(19) 
That is, we set the scale of diffuse albedo so that the largest albedo in the image is 1 and compute the coefficient of the lighting accordingly. To decide which strategy to use to compute the scale of lighting and albedo, we compute the following determinant when we regress and :
(20) 
where is the number of pixels in the image. If , we use (16) and (17), otherwise we use (18) and (19) to compute the coefficient.
Object insertion
To render a new object into the scene, we first crop a planar region and pick a point on that plane to place the new object. The orientation, diffuse albedo and roughness value of the plane are all obtained from our predictions. We then render the plane and the object together using the lighting predicted at the point where we place the object. We render the plane and the new object together to ensure interreflections between them are properly simulated. We compute a high resolution environment map () from the estimated spherical Gaussian parameters so that even very glossy material can be correctly handled.
We render two images, and and two binary masks, and . is the rendered image of plane and object and is the rendered image of the plane only. is the mask of the object and is the mask covered both the cropped plane and the object. We then edit the region of object and the region of cropped plane separately. Let be the original image and be the new image with the new rendered object. For the object region, we directly use the intensities as rendered in on the virtual object:
(21) 
For the remaining region on the plane, we blend in the original image intensities with the ratio of and :
(22) 
All operations in the above relation are pixelwise. This compositing procedure utilizes the idea of ratio (or quotient images) that has been used in the past for relighting [42, 50]. It ensures that global effects due to objectplane interaction, such as soft shadows, are visualized (since they are rendered in but absent in ), while keeping intensities consistent with the overall image. This suppresses high frequency artifacts in that might be caused by minor errors in estimation of albedo, roughness and lighting, thereby achieving greater photorealism.
Figures 14, 2 and 1 show several examples of object insertion on real images. In all these examples, we render white glossy objects with diffuse albedo and roughness value . We use white color so that the color of lighting and global illumination effects can be clearly observed. We keep the shape simple and the roughness value low to better demonstrate the high frequency component in the predicted lighting. To better demonstrate our performance, a video of an object moving around the scene rendered by our prediction can be found at this link.
Material Editing
Editing material properties of objects in a scene using a single photograph has applications for interior design and visualization. Our disentangled shape, material and lighting estimation allows rendering new appearances by replacing material and rendering using the estimated lighting. In Figure 3 and 15, we replace the material of a planar region with another kind of material and render the image using the predicted geometry and spatially varying lighting, where the spatially varying properties of the predicted lighting can be clearly observed. In the first example in Figure 3, we can see the specular high light in the original image is preserved after changing the material, such specular high light effect can not be modeled by traditional intrinsic decomposition method since the direction of the incoming lighting is unknown. In Figure 16, we add specular highlight to the selected object by changing the roughness value to 0.2 and render the object with predicted diffuse albedo, geometry and spatially varying lighting. We compute the residual image before and after changing the roughness value and add it back to the original image. Even though the difference is quite subtle, we observe that the distribution of the specular highlight looks plausible.
6 Conclusion
We have presented the first holistic inverse rendering framework that estimates disentangled shape, SVBRDF and spatiallyvarying lighting, from a single image of an indoor scene. Insights from computer vision, graphics and deep convolutional networks are utilized to solve this challenging illposed problem. A GPUaccelerated renderer is used to synthesize a largescale, realistic dataset with complex materials and global illumination. Our perpixel SVSG lighting representation captures high frequency effects. Our network design imbibes intuitions such as a differentiable rendering layer, which are crucial for generalization to real images. Design choices such as a cascade structure and a bilateral solver lead to further benefits. Despite solving the joint problem, we obtain competitive results with respect to prior works that focus on constituent subproblems, which highlights the impact of our dataset, representation choices and network design. We demonstrate object insertion and material editing applications on real images that capture global illumination effects, motivating applications in augmented reality and interior design.
Appendix A Appendix Outline
We have presented a method to automatically disentangle a single image of an indoor scene into its constituent physical scene factors – geometry, spatiallyvarying reflectance, and illumination. In these appendices, we present more results, analyses and details. This includes: more challenging cases for our model (Appendix B and Appendix C), details about our SVBRDF model (Appendix D), dataset creation (Appendix E), our lighting model (Appendix F) and our network architecture and training details (Appendix G).
Appendix B Generalization to Outdoor Scenes
In this section, we test how well our model, which is trained with synthetic indoor scenes only, generalizes to outdoor scenes. The qualitative results are shown in Figure 17. While the network tries to interpret the outdoor scene into a room surrounded by walls, we observe that the overall estimation of geometry, lighting and diffuse albedo look reasonable. We also try to insert a new object into the scene using our predictions following the pipeline proposed in Section 5, then compare with a stateoftheart outdoor lighting estimation method [24]. As shown in the last two columns in Figure 17, the method of [24] can better preserve highfrequencies in outdoor illumination, which results in shadows with hard boundaries, while our method tends to predict more low frequency lighting. This is probably due to the domain gap between training and test images. However, we notice that our model can usually predict the direction of the incoming light correctly and the spatial variation in the lighting prediction of our method looks much more realistic compared to [24], which predicts a single lighting model for the whole image.
Appendix C A Failure Case
While we observe largely successful object insertions in most experiments, some failure cases do occur. The ambiguity between albedo and lighting is a hard one to disentangle. In some cases, the albedo is estimated to be too bright (dark), with the lighting correspondingly estimated as too dark (bright). An example is shown in Figure 18. Regardless, we emphasize that being able to estimate spatiallyvarying lighting along with SVBRDF and shape is an extremely hard problem, for which our network succeeds in an overwhelming number of experiments.
Appendix D BRDF Model and Material Categories
wall paint  stone wall  leather  stone floor  plastic 
127  185  10  172  94 
stone specular  ground  fabric  wood floor  wood 
25  243  180  25  42 
Our microfacet model
We use a physically motivated microfacet BRDF model in our dataset. Let , and be the spatiallyvarying diffuse albedo, normal and roughness, respectively. The BRDF model is:
(23)  
(24)  
(25) 
where and are the diffuse and specular BRDF components. Here, and are view and lighting directions, and is the half angle vector, while , and are the distribution, Fresnel and geometric terms respectively, defined as
We set as suggested in [30].
Material categories
For mapping our materials on the SUNCG geometry in a manner consistent with semantics such as objects in the scene, we manually classified our dataset as well as the SUNCG materials into 10 categories. Samples from each dataset for these categories are shown in Figure
19. The number of material samples for each in our dataset are shown in Table 6.Appendix E Tileable Texture Synthesis
We use graphcut based approach to generate tileable texture, which has the advantages of keeping the original texture structures [34]. The overall process is summarized in Figure 20. We first crop smaller patch from the original SVBRDF textures and synthesize the cropped patch into tileable texture. Given the required size of the patch, we first globally search for the optimal patch by minimizing the gradient perpendicular to the boundary of the patches. More specifically, let and be the set of pixels on the horizontal and vertical boundaries of the patch. , and are the diffuse color, normal and roughness at pixel respectively. We search for a patch so that
(26)  
is minimized. The equation (26) can be efficiently computed using integral graph. The overall complexity of finding optimal patch will be where is the number of pixels in the image. By minimizing (26), we avoid strong gradient near the boundaries of the patch, so that we can reduce artifacts in tileable texture synthesis.
Once we find the patch, we crop not only the patch but also its surrounding regions. To make the patch tileable in direction, we overlap the right and left surrounding regions and use graphcut method to find the best seam to separate the overlapping regions by minimizing a customized energy function. Unlike energy function in [34] which encourages the value of pixels at seam to be similar to the value of pixels in the original textures, our energy function encourages the gradients of pixels at the seam to be similar to the gradients of pixels in the original textures. As in [34], we formulate the problem as a labeling problem. Let be an pixel in the overlapped texture map and be its label. With some abuse of notation, we define to be the value of pixel from patch . The gradient across the patches and is defined as . Then the loss is defined as
The final loss for is a weighted combination of losses from different texture map.
(27) 
To make the texture tileable in direction, we repeat the above process by overlapping the up and down surrounding regions and finding the seam to separate them using graphcut again. Notice that when separating the overlapping up and down regions, we need to make sure that the pixels at the right and left boundaries of the patches are from the same region so that the patch will remain tileable in x direction. We achieve this by adding an infinite smoothness term between every pair of pixels at the left and right boundaries in the same row so that they will always come from the same region. Figure 21 shows some texture synthesis examples. Each example is generated by tiling original patches together. For each material from our dataset, we crop three patches of different sizes and the three patches will be considered as different materials in the following mapping SVBRDFs stage.
Appendix F Ground Truth Spherical Gaussian Lobes
We compute groundtruth spherical Gaussian lobe parameters by approximating the environmental lighting using the LBFGS method. These parameters are used to supervise spatially varying lighting prediction. We use 12 lobes to approximate per pixel lighting. To facilitate the training process, we assign an order to the 12 lobes by constraining each lobe to be in the certain range of the hemisphere. We roughly divide the hemisphere into regions. Following the notation in Section 4, we define , , and to be the spherical Gaussian parameters where
(28) 
In order to add the constraints, we reparameterize the spherical Gaussian parameters with , , and such that
(29)  
(30)  
(31)  
(32) 
where and are scaling factors. Here, and are offset parameters that are computed as
(33)  
(34) 
The initialization of the parameters are , , and . The loss function is the logencoded loss as described in (9).
Figure 22 compares using spherical Gaussian and spherical harmonics to approximate the spatially varying lighting, which corresponds Figure 8 and Table 1 in Section 4. It is clearly observed that with a similar number of parameters, spherical Gaussians can better recover high frequency effects, resulting in a reconstructed spatiallyvarying lighting closer to ground truth.
Appendix G Network Structures and Training Details
epochs  iters.  batch  
1.5  1.0  0.5  0.5            21      16  
        10  10  1  0.5  15      4  
7.5  5.0  2.5  2.5  10  10  1  0.5    4000  4  
1.5  1.0  0.5  0.5            8      6  
        10  10  1  0.5  8      4  
7.5  5.0  2.5  2.5  10  10  1  0.5    4000  3 
iters.  batch  

1.5  0.5  0.5  600  6 
The network structures are summarized in Figure 23. Note that we use group normalization [58]
instead of batch normalization so that we can train the network with smaller batch size. The padding size is dynamically assigned according to the feature map size so that the feature maps after upsampling can be aligned with the feature maps coming from skip links. Therefore, our network can process image of arbitrary size without scaling and cropping. The network for spatially varying lighting predictions has more parameters because we find it necessary to achieve reasonable performances for this task.
We use Adam optimizer to train our network. Each level of cascade network is trained separately. To train cascade network of level , we first train and separately and then finetune them together. The loss function to train is
(35) 
with various terms as defined in Sec. 4. We add the rendering loss when training . The loss function to train is
(36) 
The loss function for finetuning the whole pipeline is defined in Eq. (11). Finally, the loss function to train is
(37) 
All other hyper parameters including initial learning rate, training epochs and and coefficients are summarized in Table 7 and Table 8. The learning rates are decreased by half every 10 epochs.
Finetuning on real datasets
We use similar strategy to finetune on IIW dataset [10] and NYU dataset [52]. We take the trained model and finetune each level of cascade sequentially. The learning rate is and the batch size is 4 for the first level of cascade and 2 for the second level. In each iteration, we send two batches of images to the network, one from our synthetic dataset and the other from the real dataset. The and are trained in an endtoend manner. The loss function for images from our dataset is the same as Eq. (11). The loss function for finetuning on NYU dataset is just a combination of the loss on each component. We add the rendering loss by comparing the rendered image with the input image:
(38) 
When finetuning on IIW dataset, we include ordinal reflectance loss which is the same as defined in [36]. The loss function for images from IIW dataset is
(39) 
When training on NYU dataset, we also do data augmentation by randomly flipping, cropping and scaling the input images with a scale uniformly sampled from 0.8 to 1.2 since the dataset is relatively small.
References
 [1] Adobe Stock 3D assets. https://stock.adobe.com/3dassets.
 [2] NVIDIA OptiX. https://developer.nvidia.com/optix.

[3]
E. H. Adelson and A. P. Pentland.
Perception as bayesian inference.
chapter The Perception of Shading and Reflectance, pages 409–423. 1996.  [4] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (Proc. SIGGRAPH), 28(3), Aug. 2009.
 [5] J. Barron and J. Malik. Shape, illumination, and reflectance from shading. PAMI, 37(8):1670–1687, 2013.

[6]
J. T. Barron and J. Malik.
Intrinsic scene properties from a single rgbd image.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 17–24, 2013.  [7] J. T. Barron and B. Poole. The fast bilateral solver. In European Conference on Computer Vision, pages 617–632. Springer, 2016.
 [8] H. G. Barrow and J. M. Tenenbaum. Recovering intrinsic scene characteristics from images. Computer Vision Systems, pages 3–26, 1978.
 [9] R. Basri and D. W. Jacobs. Lambertian reflectance and linear subspaces. PAMI, 25(2), 2003.
 [10] S. Bell, K. Bala, and N. Snavely. Intrinsic images in the wild. ACM Transactions on Graphics (TOG), 33(4):159, 2014.
 [11] S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material recognition in the wild with the materials in context database. Computer Vision and Pattern Recognition (CVPR), 2015.
 [12] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In SIGGRAPH, pages 187–194, 1999.
 [13] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3D: Learning from RGBD data in indoor environments. International Conference on 3D Vision (3DV), 2017.
 [14] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An informationrich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
 [15] C. Che, F. Luan, S. Zhao, K. Bala, and I. Gkioulekas. Inverse transport networks. arXiv preprint arXiv:1809.10820, 2018.
 [16] Q. Chen and V. Koltun. Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1511–1520, 2017.
 [17] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richlyannotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
 [18] V. Deschaintre, M. Aittala, F. Durand, G. Drettakis, and A. Bousseau. Singleimage svbrdf capture with a renderingaware deep network. ACM Transactions on Graphics (TOG), 37(4):128, 2018.
 [19] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multiscale convolutional architecture. In ICCV, 2015.
 [20] M.A. Gardner, K. Sunkavalli, E. Yumer, X. Shen, E. Gambaretto, C. Gagné, and J.F. Lalonde. Learning to predict indoor illumination from a single image. ACM Trans. Graphics, 9(4), 2017.
 [21] S. Georgoulis, K. Rematas, T. Ritschel, M. Fritz, T. Tuytelaars, and L. V. Gool. What is around the camera? In ICCV, 2017.
 [22] P. Green, J. Kautz, and F. Durand. Efficient reflectance and visibility approximations for environment map rendering. In Computer Graphics Forum, volume 26, pages 495–502. Wiley Online Library, 2007.
 [23] Y. HoldGeoffroy, A. Athawale, and J.F. Lalonde. Deep sky modeling for single image outdoor lighting estimation. In CVPR, 2019.
 [24] Y. HoldGeoffroy, K. Sunkavalli, S. Hadap, E. Gambaretto, and J.F. Lalonde. Deep outdoor illumination estimation. In CVPR, 2017.
 [25] B. K. P. Horn and M. J. Brooks, editors. Shape from Shading. MIT Press, Cambridge, MA, USA, 1989.
 [26] W. Jakob. Mitsuba renderer, 2010. http://www.mitsubarenderer.org.
 [27] M. K. Johnson and E. H. Adelson. Shape estimation in natural illumination. In CVPR, 2011.
 [28] Y. Kanamori and Y. Endo. Relighting humans: occlusionaware inverse rendering for fullbody human images. SIGGRAPH Asia, 37(270):1–270, 2018.
 [29] B. Karis and E. Games. Real shading in unreal engine 4. Proc. Physically Based Shading Theory Practice, 4, 2013.
 [30] B. Karis and E. Games. Real shading in unreal engine 4. Proc. Physically Based Shading Theory Practice, 2013.
 [31] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
 [32] K. Karsch, V. Hedau, D. Forsyth, and D. Hoiem. Rendering synthetic objects into legacy photographs. ACM Transactions on Graphics, 30(6):1, 2011.
 [33] K. Karsch, K. Sunkavalli, S. Hadap, N. Carr, H. Jin, R. Fonte, M. Sittig, and D. Forsyth. Automatic scene inference for 3d object compositing. ACM Transactions on Graphics, (3):32:1–32:15, 2014.
 [34] V. Kwatra, A. Schödl, I. Essa, G. Turk, and A. Bobick. Graphcut textures: image and video synthesis using graph cuts. TOG, 22(3):277–286, 2003.
 [35] T.M. Li, M. Aittala, F. Durand, and J. Lehtinen. Differentiable monte carlo ray tracing through edge sampling. ACM Trans. Graph. (Proc. SIGGRAPH Asia), 37(6):222:1–222:11, 2018.
 [36] Z. Li and N. Snavely. Cgintrinsics: Better intrinsic image decomposition through physicallybased rendering. In ECCV, pages 371–387, 2018.
 [37] Z. Li, K. Sunkavalli, and M. Chandraker. Materials for masses: Svbrdf acquisition with a single mobile phone image. In ECCV, pages 72–87, 2018.
 [38] Z. Li, Z. Xu, R. Ramamoorthi, K. Sunkavalli, and M. Chandraker. Learning to reconstruct shape and spatiallyvarying reflectance from a single image. In SIGGRAPH Asia, page 269. ACM, 2018.
 [39] L. Liang, C. Liu, Y.Q. Xu, B. Guo, and H.Y. Shum. Realtime texture synthesis by patchbased sampling. ACM Transactions on Graphics (ToG), 20(3):127–150, 2001.
 [40] G. Liu, D. Ceylan, E. Yumer, J. Yang, and J.M. Lien. Material editing using a physically based rendering network. 2017.
 [41] S. Lombardi and K. Nishino. Reflectance and natural illumination from a single image. In ECCV, 2012.
 [42] S. R. Marschner and D. P. Greenberg. Inverse lighting for photography. In Color and Imaging Conference, volume 1997, pages 262–265. Society for Imaging Science and Technology, 1997.
 [43] A. Meka, M. Maximov, M. Zollhoefer, A. Chatterjee, H.P. Seidel, C. Richardt, and C. Theobalt. Lime: Live intrinsic material estimation. In CVPR, 2018.
 [44] J. Moritz, S. James, T. S. Haines, T. Ritschel, and T. Weyrich. Texture stationarization: Turning photos into tileable textures. In Computer Graphics Forum, volume 36, pages 177–188. Wiley Online Library, 2017.
 [45] A. Ngan, F. Durand, and W. Matusik. Experimental analysis of brdf models. Rendering Techniques, 2005(16th):2, 2005.
 [46] G. Oxholm and K. Nishino. Shape and reflectance from natural illumination. In ECCV, 2012.
 [47] B. T. Phong. Illumination for computer generated pictures. Communications of the ACM, 18(6):311–317, 1975.
 [48] R. Ramamoorthi and P. Hanrahan. An efficient representation for irradiance environment maps. In SIGGRAPH, 2001.
 [49] F. Romeiro and T. Zickler. Blind reflectometry. In ECCV, 2010.
 [50] A. Shashua and T. RiklinRaviv. The quotient image: Classbased rerendering and recognition with varying illuminations. PAMI, 23(2):129–139, Feb. 2001.
 [51] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing with intrinsic image disentangling. In CVPR, 2017.
 [52] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision, pages 746–760. Springer, 2012.
 [53] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. Proceedings of 30th IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [54] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1746–1754, 2017.
 [55] T. Sun, H. W. Jensen, and R. Ramamoorthi. Connecting measured brdfs to analytic brdfs by datadriven diffusespecular separation. ACM Transactions on Graphics (TOG), 37(6):273, 2018.

[56]
A. Tewari, M. Zollhofer, H. Kim, P. Garrido, F. Bernard, P. Perez, and
C. Theobalt.
Mofa: Modelbased deep convolutional face autoencoder for unsupervised monocular reconstruction.
In ICCV, 2018. 
[57]
Y.T. Tsai and Z.C. Shih.
Allfrequency precomputed radiance transfer using spherical radial basis functions and clustered tensor approximation.
In TOG, volume 25, pages 967–976. ACM, 2006.  [58] Y. Wu and K. He. Group normalization. In ECCV, pages 3–19, 2018.
 [59] K. Xu, W.L. Sun, Z. Dong, D.Y. Zhao, R.D. Wu, and S.M. Hu. Anisotropic spherical gaussians. ACM Transactions on Graphics (TOG), 32(6):209, 2013.
 [60] Y. Zhang, S. Song, E. Yumer, M. Savva, J.Y. Lee, H. Jin, and T. Funkhouser. Physicallybased rendering for indoor scene understanding using convolutional neural networks. CVPR, 2017.
Comments
There are no comments yet.