Neural Inverse Rendering of an Indoor Scene from a Single Image

01/08/2019 ∙ by Soumyadip Sengupta, et al. ∙ 12

Inverse rendering aims to estimate physical scene attributes (e.g., reflectance, geometry, and lighting) from image(s). As a long-standing, highly ill-posed problem, inverse rendering has been studied primarily for single 3D objects or with methods that solve for only one of the scene attributes. To our knowledge, we are the first to propose a holistic approach for inverse rendering of an indoor scene from a single image with CNNs, which jointly estimates reflectance (albedo and gloss), surface normals and illumination. To address the lack of labeled real-world images, we create a large-scale synthetic dataset, named SUNCG-PBR, with physically-based rendering, which is a significant improvement over prior datasets. For fine-tuning on real images, we perform self-supervised learning using the reconstruction loss, which re-synthesizes the input images from the estimated components. To enable self-supervised learning on real data, our key contribution is the Residual Appearance Renderer (RAR), which can be trained to synthesize complex appearance effects (e.g., inter-reflection, cast shadows, near-field illumination, and realistic shading), which would be neglected otherwise. Experimental results show that our approach outperforms state-of-the-art methods, especially on real images.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 14

page 15

page 16

page 17

page 18

page 19

page 20

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As one of the core problems in computer vision, inverse rendering aims to estimate physical attributes (

e.g., geometry, reflectance, and illumination) of a scene from photographs, with wide applications in gaming, AR/VR, and robotics [12, 14, 40, 42]. Inverse rendering is severely under-constrained and ill-posed, especially with a single image as input. While previous methods [1, 24, 30, 39]

have shown some promising results by relying on statistical priors on shape, material, and illumination, they mostly focus on single 3D objects. With the recent success of using deep CNNs to learn priors for scene understanding 

[6, 8, 10, 15, 25], we ask the question: to what extent can deep CNNs help to solve the inverse rendering of a 3D scene from a single image?

In this paper*The authors contributed to this work when they were at NVIDIA., we take the first step and propose a holistic data-driven approach for inverse rendering of an indoor scene from a single image with CNNs. Unlike recent works which either estimate only one of the scene attributes [8, 6, 21] or are limited to single objects [15, 23, 25], our approach jointly learns and estimates reflectance (albedo and gloss), surface normals, and illumination of an indoor scene from a single image, as shown in Fig. 1. Inverse rendering of a scene is particularly challenging due to complex appearance effects (e.g., inter-reflection, cast shadows, near-field illumination, and realistic shading).

Two key contributions make our approach possible. 1) We introduce a new synthetic dataset using physically based rendering. 2) Our aim is to learn from unlabeled real data using self-supervised reconstruction loss. To this end, we propose the Residual Appearance Renderer (RAR) which learns to predict complex appearance effects on real images.

Rendering dataset. It is especially challenging for inverse rendering tasks to obtain accurate ground truth labels for real images. Hence, we create a large-scale synthetic dataset with physically-based rendering, named SUNCG-PBR, of all the 3D indoor scenes from SUNCG [37]. Compared to prior work PBRS [45], SUNCG-PBR significantly improves data quality in the following ways: (1) The rendering of a scene is performed under multiple natural illuminations. (2) We render the same scene twice. Once with all materials set to Lambertian and once with the default material settings to produce image pairs (diffuse, specular). (3) We utilize deep denoising [4], which allows us to render high-quality images from limited samples per pixel. Our dataset consists of 235,893 images with labels for normal, depth, albedo, Phong [16] model parameters and semantic segmentation. Examples are shown in Fig. 5. We plan to release the SUNCG-PBR dataset.

Residual Appearance Renderer. Our key idea is to learn from unlabeled real data using self-supervised reconstruction loss, which is enabled by our proposed Residual Appearance Renderer (RAR) module. While using a reconstruction loss for domain transfer from synthetic to real has been explored previously [32, 36, 23], their renderer is limited to direct illumination under distant lighting with a single material. For real images of a scene, however, this simple direct illumination renderer cannot synthesize important, complex appearance effects, such as inter-reflections, cast shadows, near-field lighting, and realistic shading. These effects termed residual appearance in this paper, can only be simulated with the rendering equation via physically-based ray-tracing, which is non-differentiable and cannot be employed in a learning-based framework. To this end, we propose the Residual Appearance Renderer (RAR) module, which along with a direct illumination renderer, can reconstruct the original image from the estimated scene attributes, for self-supervised learning on real images.

Moreover, similar to prior works, we also incorporate sparse labels on real data (i.e., pair-wise reflectance comparison [2, 19, 47], sparse material segmentation [3]) as a form of weakly supervised learning, to further improve the performance on real images.

To our knowledge, our approach is the first data-driven solution to single-image based inverse rendering of a scene. SIRFS [1], which is a classical optimization based method, seems to be the only prior work with similar goals. Compared with SIRFS, as shown in Sec. 5, our method is more robust and accurate. In addition, we also compare with recent DL-based methods that estimate only one of the scene attributes, such as intrinsic images [19, 20, 47, 28], lighting [8], and normals [45]. Experimental results show that our approach outperforms most of these single-attribute methods (especially on real images), which seems to indicate that the joint learning of all these scene attributes is helpful for the ability to generalize.

2 Related Work

Optimization-based approaches.

For inverse rendering from a few images, most traditional optimization-based approaches make strong assumptions about statistical priors on illumination and/or reflectance. A variety of sub-problems have been studied, such as intrinsic image decomposition [39], shape from shading [30, 29], and BRDF estimation [24]. Recently, SIRFS [1] showed it is possible to factorize an image of an object or a scene into surface normals, albedo, and spherical harmonics lighting. In [33] the authors use CNNs to predict the initial depth and then solve inverse rendering with an optimization. From an RGBD video, Zhang et al[43] proposed an optimization method to obtain reflectance and illumination of an indoor room. These optimization-based methods, although physically grounded, often do not generalize well to real images where those statistical priors are no longer valid.

Learning-based approaches.

With recent advances in deep learning, researchers have proposed to learn data-driven priors to solve some of these inverse problems with CNNs, many of which have achieved promising results. For example, it is shown that depth and normals may be estimated from a single image 

[6, 7, 48] or multiple images [38]. Parametric BRDF may be estimated either from an RGBD sequence of an object  [25, 15] or for planar surfaces [21]. Lighting may also be estimated from images, either as an environment map [8, 10], or spherical harmonics [46] or point lights [44]. Recent works [36, 32] also performed inverse rendering on faces. Some recent works also jointly learn some of the intrinsic components of an object, like reflectance and illumination [9, 41], reflectance and shape [22], and normal, BRDF, and distant lighting  [34, 23]. Nevertheless, these efforts are mainly limited to objects rather than scenes, and do not model the aforementioned residual appearance effects such as inter-reflection, near-field lighting, and cast shadows present in real images.

Differentiable Renderer.

A few recent work from the graphics community proposed differentiable Monte Carlo renderers [18, 5] for optimizing rendering parameters (e.g., camera poses, scattering parameters) for synthetic 3D scenes. Neural mesh renderer [13] addressed the problem of differentiable visibility and rasterization. Our proposed RAR is in the same spirit, but its goal is to synthesize the complex appearance effects for inverse rendering on real images, which is significantly more challenging.

Figure 2: Overview of our approach. Our Inverse Rendering Network (IRN) consists of two modules IRN-Diffuse and IRN-Specular to predict albedo, normals, illumination map and glossiness segmentation respectively. We train on unlabeled real images using self-supervised reconstruction loss. Reconstruction loss consists of a closed-form Direct Renderer with no learnable parameters and proposed Residual Appearance Renderer (RAR), which learns to predict complex appearance effects.
Datasets for inverse rendering.

High-quality synthetic data is essential for learning-based inverse rendering. SUNCG [37] created a large-scale 3D indoor scene dataset. The images of the SUNCG dataset are not photo-realistic as they are rendered with OpenGL using diffuse materials and point source lighting. An extension of this dataset, PBRS [45], uses physically based rendering to generate photo-realistic images. However, due to the computational bottleneck in ray-tracing, the rendered images are quite noisy and limited to one lighting condition. There also exist a few real-world datasets with partial labels on geometry, reflectance, or lighting. NYUv2 [27] provides surface normals from indoor scenes. OpenSurface [3] provides partial segmentation of objects with their glossiness properties labeled by humans. Relative reflectance judgments from humans are provided in the IIW dataset [2] which are used in many intrinsic image decomposition methods. In contrast to these works, we created a large-scale synthetic dataset with significant image quality improvement.

Intrinsic image decomposition.

Intrinsic image decomposition is a sub-problem of inverse rendering, where a single image is decomposed into albedo and shading. Recent methods learn intrinsic image decomposition from labeled synthetic data [17, 26, 34] and from unlabeled [20] or partially labeled real data [47, 19, 28, 2]. Intrinsic image decomposition methods do not explicitly recover geometry, illumination or glossiness of the material, but rather combine them together as shading. In contrast, our goal is to perform a complete inverse rendering which has a wider range of applications in AR/VR.

3 Our Approach

We present a deep learning based approach for inverse rendering from a single 8-bit image, which is shown in Fig. 2. Specifically, given an input image , we estimate all its intrinsic components, i.e., reflectance, geometry, and lighting. The reflectance is represented as a diffuse albedo map plus Phong [16] model parameters for the specularity (i.e., and ). We represent the geometry as a normal map and the lighting as an environment map . To make per-pixel Phong parameter estimation feasible, we simplify the task as a glossiness segmentation problem and define three glossiness categories (i.e

., matte, semi-glossy and glossy) based on the statistical distributions of BRDFs in our dataset. The output probability maps from the glossiness-based segmentation network are used as weights to compute the per-pixel Phong parameters. Thus, as shown in Fig. 

2, our proposed neural Inverse Rendering Network (IRN) consists of two sub-networks: IRN-Diffuse, denoted as which estimates , , and , and IRN-Specular, denoted as , which estimates the glossiness segmentation :

(1)
(2)

where and are the network parameters. The estimated and can then be computed from the soft segmentation mask .

Using our synthetic data SUNCG-PBR, we can simply train these two networks ( and ) with supervised learning – with only one caveat, i.e. we need to approximate the “ground truth” environment maps ( using a separate network . See Sec. 3.1 for details). To generalize on real images, we use a self-supervised reconstruction loss. Specifically, as shown in Fig. 2, we use two renderers to re-synthesize the input image from the estimations. The direct renderer is a simple closed-form shading function with no learnable parameters, which synthesizes the direct illumination part of the the raytraced image. The Residual Appearance Renderer (RAR), denoted by , is a trainable network module, which learns to synthesize the complex appearance effects

(3)
(4)

The self-supervised reconstruction loss is thus defined as . We explain the details of the direct renderer and the RAR in Sec. 3.2.

In summary, four sets of weights, , , , and , need to be learned from training. At inference time, given an input image , we run IRN-Diffuse and IRN-Specular to estimate all the intrinsic components for the scene.

3.1 Training on Synthetic Data

We first train IRN-Diffuse and IRN-Specular on our synthetic dataset SUNCG-PBR with supervised learning. As shown in Fig. 2, IRN-Diffuse has a structure similar to [32], which consists of a convolutional encoder, followed by nine residual blocks and a convolutional decoder for estimating albedo and normals. We condition the lighting estimation block on the image, normals and albedo features. IRN-Specular is trained on the image and the albedo predicted by IRN-Diffuse to produce a three-class glossiness segmentation map for matte, glossy and semi-glossy. IRN-Specular is based on U-Net [31], which has been shown to be effective for simple segmentation problems. Details of the IRN architecture is provided in Section  8.1.

We use ground truth albedos and normals from SUNCG-PBR for supervised learning. The ground truth glossiness segmentation mask is obtained by performing per-pixel classification based on the ground truth , , and as follows:

matte:
glossy: (5)
semi-glossy:

where , , , are based on statistical distributions of the Phong parameters and in the SUNCG-PBR dataset.

The ground truth environmental lighting is challenging to obtain, as it is the first-order approximation of the actual surface light field. We use environment maps as the exterior lighting for rendering SUNCG-PBR, but these environment maps cannot be directly set as , because the virtual cameras are placed inside each of the indoor scenes. Due to occlusions, only a small fraction of the exterior lighting (e.g., through windows and open doors) is directly visible. The surface light field of each scene is mainly attributed to global illumination (i.e., inter-reflection) and some interior lighting. One could approximate by minimizing the difference between the raytraced image and the output of the direct renderer with ground truth albedo and normal . However, we found this approximation to be inaccurate, since cannot model the residual appearance present in the raytraced image .

We thus resort to a learning-based method to approximate the ground truth lighting . Specifically, we train a residual block based network, , to predict from the input image , normalw and albedo . We first train with the images synthesized by the direct renderer with ground truth normals, albedo and indoor lighting, , where is randomly sampled from a set of real indoor environment maps. Here the network learns a prior over the distribution of indoor lighting, i.e., . Next, we fine-tune this network on the raytraced images , by minimizing the reconstruction loss: . Thus we obtain the approximated ground truth of the environmental lighting which can best reconstruct the raytraced image modeled by the direct render.

Finally, with all the ground truth components ready, the supervised loss for training IRN-Diffuse is

(6)

where , , and . We use cross-entropy loss over and for training IRN-Specular.

3.2 Training on Real Images with Self-supervision

Learning from synthetic data alone is not sufficient to perform well on real images. Although SUNCG-PBR was created with physically-based rendering, the variation of objects, materials, and illumination is still limited compared to those in real images. Since obtaining ground truth labels for inverse rendering is almost impossible for real images (especially for reflectance and illumination), we use two key ideas for domain transfer from synthetic to real: (1) self-supervised reconstruction loss and (2) weak supervision from sparse labels.

Previous works on faces [32, 36] and objects [23] have shown success in using a self-supervised reconstruction loss for learning from unlabeled real images. As mentioned earlier, the reconstruction in these prior works is limited to the direct renderer , which is a simple closed-form shading function (under distant lighting) with no learnable parameters. In this paper, we implement simply as

(7)

where corresponds to the pixels on the environment map . While using to compute the reconstruction loss may work well for faces [32] or small objects with homogeneous material [23], we found that it fails for inverse rendering of a scene. In order to synthesize the aforementioned residual appearances (e.g., inter-reflection, cast shadows, near-field lighting), we propose to use the differentiable Residual Appearance Renderer (RAR), , which learns to predict a residual image . The self-supervised reconstruction loss is thus defined as .

Figure 3: RAR learns to predict complex appearance effects (e.g. near-field lighting, cast shadows, inter-reflections) which cannot be modeled by a direct renderer (DR) .

Our goal is to train RAR to capture only the residual appearances but not to correct the artifacts of the direct rendered image due to faulty normals, albedo, and lighting estimation of the IRN. To achieve this goal, we train RAR only on synthetic data with ground-truth normals and albedo, and fix it for training on real data, so that it only learns to correctly predict the residual appearances when the direct renderer reconstruction is accurate.

As shown in Fig. 2, RAR consists of a U-Net [31], with normals and albedo as its input, and latent image features ( dimension) learned by a convolutional encoder (‘Enc’). We combine the image features at the end of the U-Net encoder and process them with the U-Net decoder to produce the residual image. As shown in Fig. 3, RAR indeed learns to synthesize complex residual appearance effects present in the original input image.

Similar to prior work [47, 19], we use sparse labels over reflectance as weak supervision during training on real images. Specifically, we use pair-wise relative reflectance judgments from the Intrinsic Image in the Wild (IIW) dataset [2] as a form of supervision over albedo. For glossiness segmentation we use sparse human annotations from the OpenSurfaces dataset [3] as weak labels. More details are provided in the Section  8.2. As shown later in Sec. 6, using such weak supervision can substantially improve performance on real images.

3.3 Training Procedure

We summarize the different stages of training from synthetic to real data. More details are in the Section  8.2.

Estimate GT indoor lighting: (a) First train on images rendered by the direct renderer . (b) Fine-tune on raytraced synthetic images to estimate GT indoor environment map .

Train on synthetic images: (a) Train IRN-Diffuse with supervised L1 loss on albedo, normal and lighting. (b) Train RAR. (c) Train IRN-Specular with the input image and the albedo predicted by IRN-Diffuse.

Train on real images: Fine-tune IRN-Diffuse and IRN-Specular on real data with (1) the pseudo-supervision over albedo, normal and lighting (to handle ambiguity of decomposition as proposed in [32]), (2) the self-supervised reconstruction loss with RAR, and (3) the weak supervision over the albedo (i.e., pair-wise relative reflectance judgment) and the sparse glossiness segmentation.

4 The SUNCG-PBR Dataset

Figure 4: Comparison with PBRS [45] and SUNCG [37]. Our dataset provides more photo-realistic and less noisy images with specular highlights under multiple lighting conditions.
Figure 5: Our SUNCG-PBR Dataset. We provide 235,893 images of a scene assuming specular and diffuse reflectance along with ground truth depth, surface normals, albedo, Phong model parameters, semantic segmentation and glossiness segmentation.

High-quality synthetic datasets are essential for learning-based inverse rendering. The SUNCG dataset [37] contains 45,622 indoor scenes with 2644 unique objects, but their images are rendered with OpenGL under fixed point light sources. The PBRS dataset [45] extends the SUNCG dataset by using physically based rendering with Mitsuba [11]. Yet, due to a limited computational budget, many rendered images in PBRS are quite noisy. Moreover, the images in PBRS are rendered with only diffuse materials and a single outdoor environment map, which also significantly limits the photo-realism of the rendered images. High-quality photo-realistic images are necessary for training RAR to capture residual appearances.

In this paper, we introduce a new dataset named SUNCG-PBR, which improves data quality in the following ways: (1) The rendering is performed under multiple outdoor environment maps. (2) We render the same scene twice, once with all materials set to Lambertian and once with the default material settings. This offers (diffuse, specular) image pairs which can be useful to the community for learning to remove highlights and many other potential applications. (3) We utilize deep denoising [4], which allows us to raytrace high-quality images from limited samples per pixel. Our dataset consists of 235,893 images with labels related to normal, depth, albedo, Phong [16] model parameters, semantic and glossiness segmentation. Examples are shown in Fig. 5. A comparison with the SUNCG and PBRS datasets is shown in Fig. 4.

5 Experimental Results

Comparison with SIRFS.

SIRFS [1] is an optimization-based method for inverse rendering, which estimates surface normals, albedo and spherical harmonics lighting from a single image. It is an inspiring work, as it shows the power of using statistical priors (over lighting, reflectance, and geometry) for inverse rendering from a single image. We compare with SIRFS on the test data from the IIW dataset [2]. As shown in Fig. 6, our method produces more accurate normals and better disambiguation of reflectance from shading. This is expected, as we are using deep CNNs, which are known to better learn and utilize statistical priors present in the data than traditional optimization techniques.

Figure 6: Comparison with SIRFS [1]. Using deep CNNs our method performs better disambiguation of reflectance from shading and predicts better surface normals.
Figure 7: Comparison with intrinsic image algorithms. Our method seems to preserve more detailed texture and has fewer artifacts in the predicted albedo, compared to the prior works.
Figure 8: Our Result. We show the estimated intrinsic components; normals, albedo, glossiness segmentation (matte-blue, glossy-red and semi-glossy-green) and lighting predicted by the network, along with the reconstructed image with our direct renderer and the RAR.
Comparison with intrinsic image decomposition algorithms.

Intrinsic image decomposition aims to decompose an image into albedo and shading, which is a sub-problem in inverse rendering. Several recent works [2, 47, 28, 19] showed promising results with deep learning. While our goal is to solve the complete inverse rendering problem, we still compare albedo prediction with these latest intrinsic image decomposition methods. We evaluate the WHDR (Weighted Human Disagreement Rate) metric [2] on the test set of the IIW dataset [2] and report the result in Table 1. As shown, we outperform these algorithms that train on the original IIW dataset [2]. Since our goal is not intrinsic image decomposition, we do not train on additional intrinsic image specific datasets and avoid any post-processing as done in CGIntrinsics [19]. We also present a qualitative comparison of the inferred albedo with different existing algorithms in Figure 7 and with Li et. al. [19] in Figure 9. As shown, our method seems to preserve more detailed texture and has fewer artifacts in the predicted albedo, compared to the prior work.

Figure 9: Comparison with CGI (Li et. al. [19]). In comparison with CGI [19], our method performs better disambiguation of reflectance from shading and preserves the texture in the albedo.

. Algorithm Training set WHDR Bell et. al. [2] - 20.6% Li et. al. [20] - 20.3% Zhou et. al. [47] IIW 19.9% Nestmeyer et. al. [28] IIW 19.5% Li et. al. [19] IIW 17.5% Ours IIW 16.7%

Table 1: Intrinsic image decomposition on the IIW test set [2]
Evaluation of lighting estimation.

We estimate an environment map of low spatial resolution from an image. Although this is not the best representation of illumination, it can still capture the significant effects of illumination and can be inferred jointly with other components. We present a qualitative evaluation of lighting estimation by inserting a diffuse hemisphere into the scene and rendering it with the inferred light from the image in Figure 10. We compare this with our implementation of the method proposed by Gardner et al[8], which also estimates an environment map from a single indoor image. is a deep network that predicts the environment map given the image, normals, and albedo. ‘GT+’ estimates the environment map given the image, ground-truth normals and albedo, and thus serves as an achievable upper-bound in the quality of the estimated lighting. ‘Ours’ estimates environment map from an image with IRN. ‘Ours+’ predicts environment map by combining the inferred albedo and normals from IRN to predict lighting with . Both ‘Ours’ and ‘Ours+’ outperform Gardner et. al. [8] as they seem to produce more realistic environment maps. ‘Ours+’ improves lighting estimation over ‘Ours’ by utilizing the predicted albedo and normals to a greater degree.

Figure 10: Evaluation of lighting estimation. We compare with our implementation of Gardner et al[8]. ‘GT+’ predicts lighting conditioned on the ground-truth normals and albedo. ‘Ours+’ predicts the environment map by conditioning it on the albedo and normals inferred by IRN.
Algorithm NYUv2 7-scenes
PBRS [45] 21.85°; 15.33° 38.34°; 25.65°
Ours 23.89°; 16.92° 37.75°; 24.54°
Table 2: Mean and median angular errors for surface normals
Evaluation of normal estimation.

We also compare with PRBS [45] which predicts only surface normals from an image. Both PRBS and our model are trained on NYUv2 [27], and are tested on both NYUv2 and 7-scenes datasets [35]. As shown in Table 2, PBRS outperforms our method by about 2 degrees on NYUv2 dataset, and it is comparable to ours in the 7-scenes dataset. This shows that our joint decomposition network IRN-Diffuse generalizes well across datasets and performs comparably to the state-of-the-art normal prediction method PBRS.

Our results.

Figure 8 shows two examples of our results, with the albedo, glossiness segmentation, normal and lighting predicted by the network, as well as the reconstructed image with the direct renderer and the proposed Residual Appearance Renderer (RAR).

6 Ablation Study

Figure 11: Role of RAR in self-supervised training. We train IRN on real data with and without RAR with self-supervision, and show the predicted albedo in column 2 and 3. The albedo predicted by training ‘without RAR’ fails to remove complex appearance effects like highlights, cast shadows and near-field lighting.
Role of the RAR in self-supervised training.

We have argued before that the RAR plays an important role in self-supervised training on unlabeled real images, because the RAR captures complex appearance effects that cannot be modeled by a direct renderer. As shown in Figure 11, when being trained without the RAR, the network fails to separate such appearance effects from the estimated albedo, as it attributes these factors to the albedo in order to minimize the reconstruction loss. When the RAR is used during training, the network correctly removes such appearance effects from the albedo. We trained our network with and without RAR on the IIW dataset [2] (without any weak supervision), and we observed that using the RAR improves the quality of the albedo by reducing the WHDR metric from 37.4% to 32.8%. Thus, we conclude qualitatively and quantitatively that the RAR improves inverse rendering with self-supervised training on real data.

Figure 12: Role of weak supervision. We predict more consistent albedo across large objects like walls, floors and ceilings using pair-wise relative reflectance judgments from the IIW dataset [2].
Role of weak supervision.

Our inverse rendering framework allows us to use weak supervision over intrinsic components whenever available. We train our method on the IIW dataset [2], which contains sparse pair-wise relative reflectance judgments from humans. Training with this weak supervision significantly improves albedo estimation, making it more consistent across large objects like walls, floors, and ceilings as shown in Figure 12.

Role of albedo in glossiness segmentation.

We also show that conditioning IRN-Specular on the albedo predicted by IRN-Diffuse significantly improves glossiness segmentation. We train IRN-Specular with and without albedo as its input on our synthetic SUNCG-PBR dataset. Conditioning on albedo predicted by IRN-Diffuse as its input improves glossiness segmentation on real data by reducing cross-entropy loss from 0.76 to 0.62, which shows the benefits of joint multi-task learning in inverse rendering.

7 Conclusion and Discussion

We present a holistic deep learning based approach for inverse rendering of an indoor scene from a single RGB image. Experimental results show our method achieves better performance compared to prior work for the estimation of albedo and lighting and comparable performance for estimation of normals, which shows the effectiveness of joint learning. We create a large-scale high-quality synthetic dataset SUNCG-PBR with physically-based rendering. We also propose a novel Residual Appearance Renderer (RAR), which is a differentiable network module that can synthesize complex appearance effects such as inter-reflection, cast shadows, near-field illumination, and realistic shading. We show that this renderer is important for employing the self-supervised reconstruction loss to solve inverse rendering on real images. This paper lays the groundwork for future studies of inverse rendering from a single image, photo-collections or videos.

References

  • [1] J. T. Barron and J. Malik. Shape, illumination, and reflectance from shading. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2015.
  • [2] S. Bell, K. Bala, and N. Snavely. Intrinsic images in the wild. ACM Transactions on Graphics (TOG), 33(4):159, 2014.
  • [3] S. Bell, P. Upchurch, N. Snavely, and K. Bala. OpenSurfaces: A richly annotated catalog of surface appearance. ACM Transactions on Graphics (SIGGRAPH), 32(4), 2013.
  • [4] C. R. A. Chaitanya, A. S. Kaplanyan, C. Schied, M. Salvi, A. Lefohn, D. Nowrouzezahrai, and T. Aila.

    Interactive reconstruction of monte carlo image sequences using a recurrent denoising autoencoder.

    ACM Transactions on Graphics (TOG), 36(4):98, 2017.
  • [5] C. Che, F. Luan, S. Zhao, K. Bala, and I. Gkioulekas. Inverse transport networks. arXiv preprint arXiv:1809.10820, 2018.
  • [6] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In International Conference on Computer Vision (ICCV), pages 2650–2658, 2015.
  • [7] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao. Deep ordinal regression network for monocular depth estimation. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2018.
  • [8] M.-A. Gardner, K. Sunkavalli, E. Yumer, X. Shen, E. Gambaretto, C. Gagné, and J.-F. Lalonde. Learning to predict indoor illumination from a single image. ACM Transactions on Graphics (TOG), 36(6):176, 2017.
  • [9] S. Georgoulis, K. Rematas, T. Ritschel, M. Fritz, L. Van Gool, and T. Tuytelaars. DeLight-Net: Decomposing reflectance maps into specular materials and natural illumination. arXiv preprint arXiv:1603.08240, 2016.
  • [10] Y. Hold-Geoffroy, K. Sunkavalli, S. Hadap, E. Gambaretto, and J.-F. Lalonde. Deep outdoor illumination estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, 2017.
  • [11] W. Jakob. Mitsuba renderer, 2010. http://www.mitsuba-renderer.org.
  • [12] K. Karsch, V. Hedau, D. Forsyth, and D. Hoiem. Rendering synthetic objects into legacy photographs. In ACM Transactions on Graphics (SIGGRAPH Asia), pages 157:1–157:12, 2011.
  • [13] H. Kato, Y. Ushiku, and T. Harada. Neural 3D mesh renderer. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3907–3916, 2018.
  • [14] E. A. Khan, E. Reinhard, R. W. Fleming, and H. H. Bülthoff. Image-based material editing. ACM Transactions on Graphics (TOG), 25(3):654–663, 2006.
  • [15] K. Kim, J. Gu, S. Tyree, P. Molchanov, M. Nießner, and J. Kautz. A lightweight approach for on-the-fly reflectance estimation. In International Conference on Computer Vision (ICCV), pages 20–28, 2017.
  • [16] E. P. Lafortune and Y. D. Willems. Using the modified Phong reflectance model for physically based rendering. 1994.
  • [17] L. Lettry, K. Vanhoey, and L. Van Gool. DARN: a deep adversarial residual network for intrinsic image decomposition. In IEEE Workshop on Applications of Computer Vision (WACV), 2018.
  • [18] T.-M. Li, M. Aittala, F. Durand, and J. Lehtinen. Differentiable monte carlo ray tracing through edge sampling. 37(6):222:1–222:11, 2018.
  • [19] Z. Li and N. Snavely. CGIntrinsics: Better intrinsic image decomposition through physically-based rendering. European Conference on Computer Vision (ECCV), 2018.
  • [20] Z. Li and N. Snavely. Learning intrinsic image decomposition from watching the world. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [21] Z. Li, K. Sunkavalli, and M. Chandraker. Materials for masses: SVBRDF acquisition with a single mobile phone image. In European Conference on Computer Vision (ECCV), 2018.
  • [22] Z. Li, Z. Xu, R. Ramamoorthi, K. Sunkavalli, and M. Chandraker. Learning to reconstruct shape and spatially-varying reflectance with a single image. In ACM Transactions on Graphics (SIGGRAPH Asia), 2018.
  • [23] G. Liu, D. Ceylan, E. Yumer, J. Yang, and J.-M. Lien. Material editing using a physically based rendering network. In International Conference on Computer Vision (ICCV), 2017.
  • [24] S. Lombardi and K. Nishino. Reflectance and illumination recovery in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(1):129–141, 2016.
  • [25] A. Meka, M. Maximov, M. Zollhoefer, A. Chatterjee, H.-P. Seidel, C. Richardt, and C. Theobalt. LIME: Live intrinsic material estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [26] T. Narihira, M. Maire, and S. X. Yu. Direct intrinsics: Learning albedo-shading decomposition by convolutional regression. In International Conference on Computer Vision (ICCV), pages 2992–2992, 2015.
  • [27] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor segmentation and support inference from RGBD images. In European Conference on Computer Vision (ECCV), 2012.
  • [28] T. Nestmeyer and P. V. Gehler. Reflectance adaptive filtering improves intrinsic image estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 4, 2017.
  • [29] G. Oxholm and K. Nishino. Shape and reflectance from natural illumination. In European Conference on Computer Vision (ECCV), pages 528–541. Springer, 2012.
  • [30] E. Prados and O. Faugeras. Shape from shading. In Handbook of mathematical models in computer vision, pages 375–388. Springer, 2006.
  • [31] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (MICCAI), pages 234–241. Springer, 2015.
  • [32] S. Sengupta, A. Kanazawa, C. D. Castillo, and D. W. Jacobs. Sfsnet: Learning shape, refectance and illuminance of faces in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [33] E. Shelhamer, J. T. Barron, and T. Darrell. Scene intrinsics and depth from a single image. In International Conference on Computer Vision, Workshops (ICCV-W), pages 37–44, 2015.
  • [34] J. Shi, Y. Dong, H. Su, and X. Y. Stella. Learning non-lambertian object intrinsics across shapenet categories. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5844–5853, 2017.
  • [35] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2930–2937, 2013.
  • [36] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing with intrinsic image disentangling. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5444–5453, 2017.
  • [37] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic scene completion from a single depth image. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [38] T. Taniai and T. Maehara. Neural inverse rendering for general reflectance photometric stereo. In

    Intl. Conf. on Machine Learning (ICML)

    , pages 20–28, 2017.
  • [39] M. F. Tappen, W. T. Freeman, and E. H. Adelson. Recovering intrinsic images from a single image. In Advances in Neural Information Processing Systems (NIPS), pages 1367–1374, 2003.
  • [40] B. Tunwattanapong and P. Debevec. Interactive image-based relighting with spatially-varying lights. In ACM Transactions on Graphics (SIGGRAPH), 2009.
  • [41] T. Wang, T. Ritschel, and N. Mitra. Joint material and illumination estimation from photo sets in the wild. In International Conference on 3D Vision (3DV), pages 22–31, 2018.
  • [42] Z. Xu, K. Sunkavalli, S. Hadap, and R. Ramamoorthi. Deep image-based relighting from optimal sparse samples. ACM Transactions on Graphics (TOG), 37(4):126, 2018.
  • [43] E. Zhang, M. F. Cohen, and B. Curless. Emptying, refurnishing, and relighting indoor spaces. ACM Transactions on Graphics (TOG), 35(6):174, 2016.
  • [44] E. Zhang, M. F. Cohen, and B. Curless. Discovering point lights with intensity distance fields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6635–6643, 2018.
  • [45] Y. Zhang, S. Song, E. Yumer, M. Savva, J.-Y. Lee, H. Jin, and T. Funkhouser.

    Physically-based rendering for indoor scene understanding using convolutional neural networks.

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [46] H. Zhou, J. Sun, Y. Yacoob, and D. W. Jacobs. Label denoising adversarial network (LDAN) for inverse lighting of face images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [47] T. Zhou, P. Krahenbuhl, and A. A. Efros. Learning data-driven reflectance priors for intrinsic image decomposition. In International Conference on Computer Vision (ICCV), pages 3469–3477, 2015.
  • [48] W. Zhuo, M. Salzmann, X. He, and M. Liu. Indoor scene structure analysis for single image depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 614–622, 2015.

8 Appendix

In this appendix we provide the details of our network architecture and the loss functions along with more qualitative evaluations. Specifically, in Section

8.1 we discuss the details of the IRN and RAR network architectures for reproducibility. Details of our training loss functions on real data are provided in Section 8.2. In Section 8.4 we present additional qualitative evaluations.

8.1 Network Architectures

Figure 13: Our Proposed Architecture.

Our proposed Inverse Rendering Network (IRN), shown again in Figure 13 for reference, consists of two modules IRN-Diffuse and IRN-Specular. IRN is trained on real data using the Residual Appearance Renderer (RAR), which learns to capture the complex appearance effects(e.g. inter-reflection, cast shadows, near-field illumination, and realistic shading). Next, we describe each of the following modules, IRN-Diffuse, IRN-Specular and RAR.

8.1.1 IRN-Diffuse

Figure 14: IRN-Diffuse.

In Figure 14 we present the network architecture of IRN-Diffuse. The input to IRN-Diffuse is an image of spatial resolution , and the output is an albedo and normal map of same spatial resolution along with a resolution environment map. We provide the details of each of the blocks of IRN-Diffuse.

‘Enc’: C64(k7) - C*128(k3) - C*256(k3)
‘CN(kS)’ denotes convolution layers with N

filters with stride 1, followed by Batch Normalization and ReLU. ‘C*N(kS)’ denotes convolution layers with N

filters with stride 2, followed by Batch Normalization and ReLU. The output of ‘Enc’ layer produces a blob of spatial resolution .

‘Normal ResBLKs’: 9 ResBLK
This consists of 9 Residual Blocks, ‘ResBLK’s, which operate at a spatial resolution of . Each ‘ResBLK’ consists of Conv256(k3) - BN -ReLU - Conv256(k3) - BN, where ‘ConvN(kS)’ and ‘BN’ denote convolution layers with N filters of stride 1 and Batch Normalization.

‘Albedo ResBLKs’: Same as ‘Normal Residual Blocks’ (weights are not shared).

‘Dec.’: CD*128(k3)-CD*64(k3)-Co3(k7)
‘CD*N(kS)’ denotes Transposed Convolution layers with N filters with stride 2, followed by Batch Normalization and ReLU. ‘CN(kS)’ denotes convolution layers with N filters with stride 1, followed by Batch Normalization and ReLU. The last layer Co3k(7) consists of only convolution layers of 3 filters, followed by Tanh layer.

‘Light Est.’: It first concatenates the responses of ‘Enc’, ‘Normal ResBLKs’ and ‘Albedo ResBLKs’ to produce a blob of spatial resolution . It is further processed by the following module:
C256(k1) - C*256(k3) - C*128(k3) - C*3(k3) - BU(18,36)
‘CN(kS)’ denotes convolution layers with N filters with stride 1, followed by Batch Normalization and ReLU. ‘C*N(kS)’ denotes convolution layers with N filters with stride 2, followed by Batch Normalization and ReLU. BU(18,36) upsamples the response to produce resolution environment map.

8.1.2 IRN-Specular

IRN-Specular consists of an U-Net architecture with image and albedo predicted by the IRN-Diffuse as it’s input. Like U-Net, skip connections exist between encoder and decoder of the U-Net.
‘Encoder’: C64(k3) - C64(k1) - C*64(k3) - C64(k1) -C*128(k3) - C128(k1) - C*256(k3) - C256(k1) - C*512(k3)
‘Decoder’: CU512(k3) - CU256(k3) - CU128(k3) - CU64(k3) - Co3(k1)
‘CN(kS)’ denotes convolution layers with N filters with stride 1, followed by Batch Normalization and ReLU. ‘C*N(kS)’ denotes convolution layers with N filters with stride 2, followed by Batch Normalization and ReLU. ‘CUN(kS)’ represents a bilinear up-sampling layer , followed by convolution layers with N filters with stride 1, and Batch Normalization and ReLU. ‘Co3(k1)’ consists of 3 convolution filters, followed by Tanh layer, to produce Normal or Albedo. Skip-connections exists between ‘C*N(k3)’ layers of encoder and ‘CUN(k3)’ layers of decoder.

8.1.3 Rar

As shown in Figure 13, Residual Appearance Renderer (RAR) consists of a U-Net architecture and a convolution encoder. The U-Net consists of the following architecture, with normals and albedo as its input:
‘Encoder’: C64(k3) - C*64(k3) - C*128(k3) - C*256(k3) - C*512(k3)
‘Decoder’: CU512(k3) - CU256(k3) - CU128(k3) - CU64(k3) - Co3(k1)
‘CN(kS)’ denotes convolution layers with N filters with stride 1, followed by Batch Normalization and ReLU. ‘C*N(kS)’ denotes convolution layers with N filters with stride 2, followed by Batch Normalization and ReLU. ‘CUN(kS)’ represents a bilinear up-sampling layer , followed by convolution layers with N filters with stride 1, followed by Batch Normalization and ReLU. ‘Co3(k1)’ consists of 3 convolution filters to produce Normal or Albedo. Skip-connections exists between ‘C*N(k3)’ layers of encoder and ‘CUN(k3)’ layers of decoder. The encoder ‘Enc’ that encodes image features to a latent dimensional subspace is given by: ‘Enc’: C64(k7) - C*128(k3) - C*256(k3) - C128(k1) - C64(k3) - C*32(k3) - C*16(k3) - MLP(300)
‘CN(kS)’ denotes convolution layers with N filters with stride 1, followed by Batch Normalization and ReLU. ‘C*N(kS)’ denotes convolution layers with N filters with stride 2, followed by Batch Normalization and ReLU. MLP(300) takes the response of the previous layer and outputs a 300 dimensional feature, which is concatenated with the last layer of the U-Net ‘Encoder’.

8.1.4 Environment Map Estimator

As discussed in Section 3.1 of the main paper, the ground-truth environment map is estimated from the image, ground-truth albedo and normal using a deep network . The detailed architecture of this network is presented below:
C64(k7) - C*128(k3) - C*256(k3) - 4 ResBLKS - C256(k1) - C*256(k3) - C*128(k3) - C*3(k3) - BU(18,36),
where, ‘CN(kS)’ denotes convolution layers with N filters with stride 1, followed by Batch Normalization and ReLU. ‘C*N(kS)’ denotes convolution layers with N filters with stride 2, followed by Batch Normalization and ReLU. BU(18,36) upsamples the response to produce resolution environment map. Each ‘ResBLK’ contains Conv256(k3) - BN -ReLU - Conv256(k3) - BN, where ‘ConvN(kS)’ denotes convolution layers with N filters of stride 1, ‘BN’ denoted Batch Normalization.

8.2 Training Details

8.2.1 Training with weak supervision over albedo

IIW dataset presents relative reflectance judgments from humans. For any two points and

on an image, a weighted confidence score classifies

to be same, brighter or darker than . We use these labels to construct a hinge loss for sparse supervision based on WHRD metric presented in [2]. Specifically, if users predict to be darker than with confidence , we use a loss . If and are predicted to have similar reflectance, we use . We observed empirically that this loss function performs better than WHRD metric, which is an L0 version of our loss. We train on real data with the following losses: (i) Psuedo-supervision loss over albedo (), normal () and lighting () based on [32], (ii) Photometric Reconstruction loss with RAR () (iii) Pair-wise weak supervision (). Thus the net loss function is defined as:

(8)

8.2.2 Training with weak supervision over normals

We also train on NYUv2 dataset with weak supervision over normals, obtained from Kinect depth data of the scene. We train with the following losses: (i) Psuedo-supervision loss over albedo () and lighting () based on [32], (ii) Photometric Reconstruction loss with RAR () (iii) Supervision () over kinect normals. Thus the net loss function is defined as:

(9)

8.3 Our SUNCG-PBR Dataset

We present more example images of our SUNCG-PBRS dataset in Figure 15. We also compare the renderings of our SUNCG-PBR Dataset with that of PBRS [45], under same illumination condition in Figure 16 and 17. SUNCG-PBR provides more photo-realistic and less noisy images with specular highlights. Both SUNCG-PBR and PBRS is rendered with Mitsuba [11]. We will release the dataset upon publication.

8.4 More Experimental Results

Comparison with SIRFS

. We present more detailed qualitative evaluations in this section. In Figure 19 we compare the results of our algorithm with that of SIRFS [1]. SIRFS is an optimization-based method for inverse rendering, which estimates surface normals, albedo and spherical harmonics lighting from a single image. Compared to SIRFS we obtain more accurate normals and better disambiguation of reflectance from shading.

Comparison with CGIntrinsic

. In Figure 18 we compare the albedo predicted by our method with that of CGIntrinsics [19], which performs intrinsic image decomposition of an image. Intrinsic image decomposition methods do not explicitly recover geometry, illumination or glossiness of the material, but rather combine them together as shading. In contrast, our goal is to perform a complete inverse rendering which has a wider range of applications in AR/VR.

Evaluation of lighting estimation

. In Figure 20 we present a qualitative evaluation of lighting estimation by inserting a diffuse hemisphere into the scene and rendering it with the inferred light from the image. We compare this with our implementation of the method proposed by Gardner et al[8], which also estimates an environment map from a single indoor image. is a deep network that predicts the environment map given the image, normals, and albedo. ‘GT+’ estimates the environment map given the image, ground-truth normals and albedo, and thus serves as an achievable upper-bound in the quality of the estimated lighting. ‘Ours’ estimates environment map from an image with IRN. ‘Ours+’ predicts environment map by combining the inferred albedo and normals from IRN to predict lighting with . Both ‘Ours’ and ‘Ours+’ outperform Gardner et al[8] as they seem to produce more realistic environment maps. ‘Ours+’ improves lighting estimation over ‘Ours’ by utilizing the predicted albedo and normals to a greater degree.

Our Results and Ablation study

. Figure 21 shows examples of our results, with the albedo, glossiness segmentation, normal and lighting predicted by the network, as well as the reconstructed image with the direct renderer and the proposed Residual Appearance Renderer (RAR). In Figure 22 and 23, we perform a detailed ablation study of different components of our method. We show that it is important to train on real data, as networks trained on synthetic data fails to generalize well on real data. We also show that training on real data using Residual Appearance Renderer (RAR), to capture complex appearance effects, significantly improves performance. Finally, incorporating weak supervisions from relative reflectance judgments helps the network to predict uniform albedo across large objects.

Figure 15: Our SUNCG-PBR Dataset. We provide 235,893 images of a scene assuming specular and diffuse reflectance along with ground truth depth, surface normals, albedo, Phong model parameters, semantic segmentation and glossiness segmentation.
Figure 16: Comparison with PBRS [45]. Our dataset provides more photo-realistic and less noisy images with specular highlights under multiple lighting conditions.
Figure 17: Comparison with PBRS [45]. Our dataset provides more photo-realistic and less noisy images with specular highlights under multiple lighting conditions.
Figure 18: Comparison with CGI (Li et. al. [19]). In comparison with CGI [19], our method performs better disambiguation of reflectance from shading and preserves the texture in the albedo.
Figure 19: Comparison with SIRFS [1]. Using deep CNNs our method performs better disambiguation of reflectance from shading and predicts better surface normals.
Figure 20: Evaluation of lighting estimation. We compare with our implementation of Gardner et al[8]. ‘GT+’ predicts lighting conditioned on the ground-truth normals and albedo. ‘Ours+’ predicts the environment map by conditioning it on the albedo and normals inferred by IRN.
Figure 21: Our Result. We show the estimated intrinsic components; normals, albedo, glossiness segmentation (matte-blue, glossy-red and semi-glossy-green) and lighting predicted by the network, along with the reconstructed image with our direct renderer and the RAR.
Figure 22: Ablation Study. We present the predicted albedo for each input image (in column 1) in column 2-5. We show the albedo predicted by IRN trained on our SUNCG-PBR only in column 2. In column 3 and 4 we show the albedo predicted by IRN finetuned on real data without and with RAR respectively. We present the albedo predicted by IRN, trained on real data with RAR and weak supervision, in column 5.
Figure 23: Ablation Study. We present the predicted albedo for each input image (in column 1) in column 2-5. We show the albedo predicted by IRN trained on our SUNCG-PBR only in column 2. In column 3 and 4 we show the albedo predicted by IRN finetuned on real data without and with RAR respectively. We present the albedo predicted by IRN, trained on real data with RAR and weak supervision, in column 5.