SPSG: Self-Supervised Photometric Scene Generation from RGB-D Scans

06/25/2020 ∙ by Angela Dai, et al. ∙ 10

We present SPSG, a novel approach to generate high-quality, colored 3D models of scenes from RGB-D scan observations by learning to infer unobserved scene geometry and color in a self-supervised fashion. Our self-supervised approach learns to jointly inpaint geometry and color by correlating an incomplete RGB-D scan with a more complete version of that scan. Notably, rather than relying on 3D reconstruction losses to inform our 3D geometry and color reconstruction, we propose adversarial and perceptual losses operating on 2D renderings in order to achieve high-resolution, high-quality colored reconstructions of scenes. This exploits the high-resolution, self-consistent signal from individual raw RGB-D frames, in contrast to fused 3D reconstructions of the frames which exhibit inconsistencies from view-dependent effects, such as color balancing or pose inconsistencies. Thus, by informing our 3D scene generation directly through 2D signal, we produce high-quality colored reconstructions of 3D scenes, outperforming state of the art on both synthetic and real data.



There are no comments yet.


page 2

page 5

page 7

page 8

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Our SPSG approach formulates the problem of generating a complete, colored 3D model from an incomplete scan observation to be self-supervised, enabling training on incomplete real-world scan data. Our key idea is to leverage a 2D view-guided synthesis for self-supervision, comparing rendered views of our predicted model to the original RGB-D frames of the scan.

The wide availability of consumer range cameras has propelled research in 3D reconstruction of real-world environments, with applications ranging from content creation to indoor robotic navigation and autonomous driving. While state-of-the-art 3D reconstruction approaches have now demonstrated robust camera tracking and large-scale reconstruction Newcombe et al. (2011); Izadi et al. (2011); Whelan et al. (2015); Dai et al. (2017a), occlusions and sensor limitation lead these approaches to yield reconstructions that are incomplete both in geometry and in color, making them ill-suited for use in the aforementioned applications.

In recent years, geometric deep learning has made significant progress in learning to reconstruct complete, high-fidelity 3D models of shapes from RGB or RGB-D observations

Maturana and Scherer (2015); Dai et al. (2017b); Riegler et al. (2017); Mescheder et al. (2019); Park et al. (2019), leveraging synthetic 3D shape data to provide supervision for the geometric completion task. Recent work has also advanced generative 3D approaches towards operating on larger-scale scenes Song et al. (2017); Dai et al. ; Dai et al. (2020). However, producing complete, colored 3D reconstructions of real-world environments remains challenging – in particular, for real-world observations, we do not have complete ground truth data available. Several promising approaches have been proposed to produce geometric and color reconstructions of 3D shapes, but tend to rely on single-object domain specificity Saito et al. (2019) or synthetic 3D data for supervision Sun et al. (2018), rendering them unsuitable for reconstructing colored 3D models of real-world scenes due to the significantly larger contextual scale and domain gap with synthetic data.

We introduce SPSG, a generative 3D approach to create high-quality 3D models of real-world scenes from partial RGB-D scan observations in a self-supervised fashion. Our self-supervised approach leverages incomplete RGB-D scans as target by generating a more incomplete version as input by removing frames. This allows correlation of more-incomplete to less-incomplete scans while ignoring unobserved regions. However, the target scan reconstruction from the given RGB-D scan suffers from inconsistencies in camera alignments and view-dependent effects, resulting in significant color artifacts. Moreover, the success of adversarial approaches in 2D image generation Goodfellow et al. (2014); Karras et al. (2017) cannot be directly adopted when the target scan is incomplete, as this results in the ‘real’ examples for the discriminator taking on incomplete characteristics. Our key observation is that while a 3D scan is incomplete, each individual 2D frame is complete from its viewpoint. Thus, we leverage the 2D signal provided by the raw RGB-D frames, which provide high-resolution, self-consistent observations as well as photo-realistic examples for adversarial and perceptual losses in 2D.

Thus, our generative 3D model predicts a 3D scene reconstruction represented as a truncated signed distance function with per-voxel colors (TSDF), where we leverage a differentiable renderer to compare the predicted geometry and color to the original RGB-D frames. In addition, we employ a 2D adversarial and 2D perceptual loss between the rendering and the original input in order to achieve sharp, high-quality, complete colored 3D reconstructions.

Our experiments show that our 2D-based self-supervised approach towards inferring complete geometric and colored 3D reconstructions produces significantly improved performance in comparison to state of state-of-the-art methods, both quantitatively and qualitatively on both synthetic and real data. We additionally analyze the effect of the 2D rendering losses in contrast to using 3D reconstruction, adversarial, and perceptual losses, and demonstrate that our 2D loss formulation avoids various artifacts introduced by a 3D loss formulation. This enables our self-supervised approach to generate compelling colored 3D models for real-world scans of large-scale scenes.

2 Related Work

RGB-D based 3D Reconstruction

3D reconstruction of objects and scenes using RGB-D data is a well explored field Newcombe et al. (2011); Izadi et al. (2011); Whelan et al. (2015); Dai et al. (2017a). For a detailed overview of 3D reconstruction methods, we refer to the state of the art report of Zollhöfer et al. (2018). In addition, our work is related to surface texturing techniques which optimize for texture in observed regions Huang et al. (2017, 2020); however, in contrast, our goal is to target incomplete scans where color data is missing in the 3D scans.

Learned Single Object Reconstruction

The reconstruction of single objects given RGB or RGB-D input is an active field of research. Many works have explored a variety of geometric shape representations, including occupancy grids Wu et al. (2015), volumetric truncated signed distance fields Dai et al. (2017b), point clouds Yang et al. (2019), and recently using deep networks to model implicit surface representations Park et al. (2019); Mescheder et al. (2019); Xu et al. (2019).

While such methods have shown impressive geometric reconstruction, generating colored objects has been far less explored. Im2Avatar Sun et al. (2018) predicts an occupancy grid to represent the shape, followed by predicting a color volume. PIFu Saito et al. (2019)

proposes to estimate a pixel-aligned implicit function representing both the shape and appearance of an object, focusing on the reconstruction of humans. While Texture Fields 

Oechsle et al. (2019) does not reconstruct 3D geometry, this approach predicts the color for a shape by estimating a function mapping a surface position to a color value. These approaches make significant progress in estimating colored reconstructions, but focus on the limited domain of objects, which are both limited in volume and far more structured than reconstruction of full scenes.

Learned Scene Completion

While there is a large corpus of work on single object reconstruction, there have been fewer efforts focusing on reconstructing scenes. SSCNet Song et al. (2017) introduce a method to jointly predict the geometric occupancy and semantic segmentation of a scene from an RGB-D image. ScanComplete Dai et al. introduces an autoregressive approach to complete partial scans of large-scale scenes. These approaches focus on geometric and semantic predictions, relying on synthetic 3D data to provide complete ground truth scenes for training, resulting in loss of quality due to the synthetic-real domain gap when applied to real-world scans. In contrast, SGNN Dai et al. (2020) proposes a self-supervised approach for geometric completion of partial scans, allowing training on real data. Our approach is inspired by that of SGNN; however, we find that their 3D self-supervision formulation is insufficient for compelling color generation, and instead propose to guide our self-supervision through 2D renderings of our 3D predictions.

3 Method Overview

Our aim is to generate a complete 3D model, with respect to both geometry and color, from an incomplete RGB-D scan. We take as input a series of RGB-D frames and estimated camera poses, fused into a truncated signed distance field representation (TSDF) through volumetric fusion Curless and Levoy (1996). The input TSDF is represented in a volumetric grid, with each voxel storing both distance and color values. We then learn to generate a TSDF representing the complete geometry and color, from which we extract the final mesh using Marching Cubes Lorensen and Cline (1987).

To effectively generate compelling color and geometry for real scan data, we develop a self-supervised approach to learn from incomplete target scans. From an incomplete target scan, we generate a more incomplete version by removing a subset of its RGB-D frames, and learn the generation process between the two levels of incompleteness while ignoring the unobserved space in the target scan. Notably, rather than relying on the incomplete target 3D colored TSDF – which contains inconsistencies from view-dependent effects, micro-misalignments in camera pose estimation, and is often lower resolution than that of the color sensor (to account for the lower resolution and noise in the depth capture) – we instead propose a 2D view-guided synthesis, relying on losses formulated on 2D renderings of our predicted TSDF. As each individual image is self-consistent and high resolution, we mitigate such artifacts by leveraging this image information to guide our predictions.

That is, we render our predicted TSDF to the views of the original images, with which we can then compare our rendered predictions and the original RGB-D frames. This allows us to exploit the consistency of each individual frame during training, as well as employ not only a reconstruction loss for geometry and color, but also adversarial and perceptual losses, where the ‘real’ target images are the raw RGB-D frames. Each of these views is complete, high-resolution, and photo-realistic, which provides guidance for our approach to learn to generate complete, high-quality, colored 3D models.

4 Self-supervised Photometric Generation

The key idea of our method for photometric scene generation from incomplete RGB-D scan observations is to formulate a self-supervised approach based on 2D view-guided synthesis, leveraging rendered views of our predicted 3D model. Since training on real-world scan data is crucial for realistic color generation, we need to be able to learn from incomplete target scan data as complete ground truth is unavailable for real-world scans.

Thus, we learn a generative process from the correlation of an incomplete target scan composed of RGB-D frames with a more incomplete version of that scan constructed from a subset of the frames . The input scan during training is then created by volumetric fusion of to a volumetric TSDF with per-voxel distances and colors. This is inspired by the SG-NN approach Dai et al. (2020); however, crucially, rather than relying on the fused incomplete target TSDF, we formulate 2D-based rendering losses to guide our geometry and color predictions. This both avoids smaller-scale artifacts from inconsistencies in camera pose estimation as well as view-dependent lighting and color balancing, and importantly, allows formulation of adversarial and perceptual losses with the raw RGB-D frames, which are individually complete views in image space. These losses are critical towards producing compelling photometric scene generation results.

Additionally, our self-supervision exploits the different patterns of incompleteness seen across a variety of target scans, where each individual target scan remains incomplete but learning across a diverse set of patterns enables generating output 3D models that have more complete, consistent geometry and color than any single target scan seen during training.

4.1 Differentiable Rendering

To formulate our 2D-based losses, we render our predicted TSDF in a differentiable fashion to generate color, depth, and world-space normal images, , , and , for a given view . We then operate on , and to formulate our reconstruction, adversarial, and perceptual losses.

Specifically, for comprising per-voxel distances and colors, and a camera view with the intrinsics (focal length, principal point), extrinsics (rotation, translation), and image dimensions, we then generate , , and by raycasting, as shown in Figure 2. For each pixel in

Figure 2: Differentiable rendering of our 3D predicted TSDF geometry and color.

the output image, we construct a ray from the view and march along through

using trilinear interpolation to determine TSDF values. To locate the surface at the zero-crossing of

, we look for sign changes between current and previous TSDF values.

For efficient search, we first use a fixed increment to search along the ray (half of the truncation value), and once a zero-crossing has been detected, we use an iterative line search to refine the estimate. The refined zero-crossing location is then used to provide the depth, normal, and color values for , , and as distance from the camera, negative gradient of the TSDF, and associated color value, respectively.

Our differentiable TSDF rendering is implemented in CUDA as a PyTorch extension for efficient runtime, with the backward pass similarly implemented through ray marching, using atomic add operations to accumulate gradient information when multiple pixels correspond to a voxel.

4.2 2D View-Guided Synthesis / Re-rendering loss

Our self-supervised approach is based on 2D losses operating on the depth, normal, and color images , , and , which are rendered from the predicted TSDF . This enables comparison to the original RGB-D frame data , (normals are computed in world space from the depth images), and , thus, avoiding explicit view inconsistencies in the targets as well as providing complete target view information. For the task of generating a complete photometric reconstruction from an incomplete scan, we employ a reconstruction loss to anchor geometry and color predictions, as well as an adversarial and perceptual loss, to capture more realistic appearance in the final prediction.

Reconstruction Loss.

We use an loss to guide depth and color to the target depth and color:


Since the rendered and may not have valid values for all pixels (where no surface geometry was seen), these losses operate only on the valid pixels , normalized by the number of valid pixels . The color loss operates on the 3 channels of the CIELAB color space, which we empirically found to provide better color performance than RGB space. Note that these reconstruction losses as formulated have a trivial solution where generating no surface geometry in provides no loss, so we employ a 3D geometric reconstruction loss on the predicted 3D TSDF distances, weighted by a small value to discourage lack of surface geometry prediction. For , we mask out any voxels which were unobserved in the target scan. The final reconstruction loss is then .

Adversarial Loss.

To capture a more realistic photometric scene generation, we employ an adversarial loss on both and . Note that since depth values are completely view dependent, we do not use this information in the adversarial loss. In particular, this helps avoid averaging artifacts when only the reconstruction loss is used, which helps markedly in addressing color imbalance in the training set (e.g., color dominated by walls/floors colors which typically have little diversity). We use the conditional adversarial loss:


where denotes concatenation, and is the condition, with where are the rendered normal and color images of the input scan from view . Note that although and can be considered complete in the image view, and may contain invalid pixels; for these invalid pixels we copy the corresponding values from and to avoid trivially recognizing real from synthesized by number of invalid pixels.

Similar to Pix2Pix 

Isola et al. (2017), we use a patch-based discriminator, on patches of images.

Perceptual Loss.

We additionally employ a loss to penalize perceptual differences from the rendered color images of our predicted TSDF. We use a pretrained VGG network Simonyan and Zisserman (2014), and use a content loss Gatys et al. (2016) where feature maps from the eighth convolutional layer are compared with an loss.


4.3 Data Generation

To generate the input and target scans and used during training, we use a random subset of the target RGB-D frames (in our experiments, ) to construct . Both and are then constructed through volumetric fusion Curless and Levoy (1996); we use a voxel resolution of cm. In order to realize efficient training, we train on cropped chunks of the input-target pairs of size voxels. For each train chunk, we associate up to five RGB-D frames based on their geometric overlap with the chunk. These frames are used as targets for the 2D losses on the rendered predictions.

4.4 Network Architecture

Figure 3: Network architecture overview. Our approach is fully-convolutional, operating on an input TSDF volume and predicting an output TSDF, from which we apply our 2D view-guided synthesis.

Our network, visualized in Figure 3, is designed to produce a 3D volumetric TSDF representation of a scene from an input volumetric TSDF. We predict both geometry and color in a fully-convolutional, end-to-end-trainable fashion. We first predict geometry, followed by color, so that the color predictions can be directly informed by the geometric structure. The geometry is predicted with an encoder-decoder structure, then color using an encoder-decoder followed by a series of convolutions which maintain spatial resolution.

The encoder-decoder for geometry prediction spatially subsamples to a factor of the original resolution, and outputs a feature map from which the final geometry is predicted. The geometric predictions then inform the color prediction, with input to the next encoder-decoder. The color prediction is structured similarly to the geometry encoder-decoder, with a series of additional convolutions maintaining the spatial resolution. We found that avoiding spatial subsampling before the color prediction helped to avoid checkering artifacts in the predicted color outputs.

Our discriminator architecture is composed of a series of 2D convolutions, each spatially subsampling its input by a factor of 2. For a detailed architecture specification, we refer to the appendix.

Training Details

We train our approach on a single NVIDIA GeForce RTX 2080. We weight the loss term with and the adversarial loss for the generator by ; all other terms in the loss have a weight of . We use the Adam optimizer with a learning rate of and batch size of , and train our model for hours until convergence. For efficient training, we train on cropped chunks of scans; at test time, since our model is fully-convolutional, we operate on entire incomplete scans of varying sizes as input.

5 Results

Method SSIM () Feature- () FID ()


PIFu Saito et al. (2019) 0.67 0.25 81.5
Texture Fields Oechsle et al. (2019) (on Ours Geometry) 0.70 0.23 68.4
Ours 0.71 0.22 56.0
Table 1: Evaluation of colored reconstruction from incomplete scans of Matterport3D Chang et al. (2017) scenes. We evaluate rendered views of the outputs of all methods against the original color images.
Method SSIM () Feature- () FID ()


Baseline-3D 0.694 0.236 80.51
Ours ( only) 0.699 0.231 67.92
Ours (no adversarial) 0.695 0.229 62.15
Ours (no perceptual) 0.699 0.227 61.46
Ours 0.709 0.219 56.03
Table 2: Ablation study of our design choices on Matterport3D Chang et al. (2017) scans.

To evaluate our SPSG approach, we consider the real-world scans from the Matterport3D dataset Chang et al. (2017), where no complete ground truth is available for color and geometry, and additionally provide further analysis on synthetic data from the chair class of ShapeNet Chang et al. (2015), where complete ground truth data is available. To enable quantitative evaluation on Matterport3D scenes, we consider input scans generated with of all available RGB-D frames for each scene, and evaluate against the target scan composed of all available RGB-D frames (ignoring unobserved space). For ShapeNet, we consider single RGB-D frame input, and the complete shape as the target.

Evaluation metrics

To evaluate our color reconstruction quality, we adopt several metrics to evaluate rendered views of the predicted meshes in comparison to the original views (as we do not have complete 3D color data available for real-world scenarios). First, we consider the Fréchet Inception Distance (FID) Heusel et al. (2017), which is commonly used to evaluate the quality of images synthesized by 2D generative techniques, and captures a distance between the distributions of synthesized images and real images. The structure similarity image metric (SSIM) Brunet et al. (2011) is often used to measure more local characteristics in comparing a synthesized image directly to the target image, but can tend to favor averaging over sharp detail. Finally, we capture a perceptual metric, Feature-, following the metric proposed in Oechsle et al. (2019), which evaluates the distance between the feature embeddings of the synthesized and target images under an InceptionV3 network Szegedy et al. (2016).

To measure the geometric quality of our reconstructed shapes and scenes, we use an intersection-over-union (IoU) metric as well as a Chamfer distance metric. IoU is computed over the voxelization of the output meshes of all approaches, with voxel size of cm for Matterport3D data and (relative to the unit normalized space) for ShapeNet data. For Chamfer distance, we sample 30K points from the output meshes as well as ground truth meshes, and compute the distance in metric space for Matterport3D and normalized space for ShapeNet. Note that for the case of real scans, all unobserved space in the target is ignored for the geometric evaluation.

For all comparisons to state-of-the-art approaches predicting both color and geometry, we provide as input the incomplete TSDF and color, and if necessary, adapt the method’s input (denoted by ).

Method SSIM () Feature- () FID ()


Im2Avatar Sun et al. (2018) 0.85 0.25 59.7
PIFu Saito et al. (2019) 0.86 0.24 70.3
Texture Fields Oechsle et al. (2019) (on Ours Geometry) 0.93 0.20 30.3
Ours 0.93 0.19 29.0
Table 3: Evaluation of colored reconstruction from incomplete scans of ShapeNet Chang et al. (2015) chairs.
Figure 4: Qualitative evaluation of colored reconstruction on Matterport3D Chang et al. (2017) scans.

Self-supervised photometric scene generation.

We demonstrate our self-supervised approach to generate reconstructions of scenes from incomplete scan data, using scan data from Matterport3D Chang et al. (2017) with the official train/test split (72/18 trainval/test scenes comprising 1788/394 rooms). Tables 1 and 4 show a comparison of our approach to state-of-the-art methods for color and geometry reconstruction: PIFu Saito et al. (2019) and Texture Fields Oechsle et al. (2019). Since Texture Fields predicts only color, we provide our predicted geometry as input; for test scenes, since it is designed for fixed volume sizes, we apply it in sliding window fashion. We additionally show qualitative results in Figure 4

. All methods were trained on the generated input-target pairs of scans from Matterport3D with frames removed from the target scan to create the corresponding inputs, and the respective proposed loss functions used for training. Note that the prior methods have all been developed for the single object scenario with full supervision available (e.g., using synthetic ground truth), and are limited in capturing the diversity in geometry and color of real-world scenes. Our self-supervised formulation with rendering losses enables capturing a more realistic distribution of geometry and color in generating complete 3D scenes.

Figure 5: Qualitative evaluation of colored reconstruction on ShapeNet Chang et al. (2015) chairs.

What is the effect of the 2D view-guided synthesis?

In Table 2, we analyze the effects of our various 2D rendering based losses, and show qualitative results in Figure 6. We first replace our rendering-based losses with analogous 3D losses, i.e., , and use the 3D incomplete target TSDF instead of 2D views (Baseline-3D). This approach learns to reflect the inconsistencies present in the fused 3D target scan (e.g., striping artifacts where one frame ends and another begins), and moreover, suffers from the incompleteness of the target scan data when used as ‘real’ examples for the discriminator and the perceptual loss (resulting in black artifacts in some missing regions). Thus, our approach to leverage rendering based losses using the original RGB-D frames produces more consistent, compelling reconstructions.

Additionally, we evaluate the effect of our adversarial and perceptual losses on the output color quality, evaluating our approach with the adversarial loss removed (Ours (no adversarial)), perceptual loss removed (Ours (no perceptual)), and both adversarial and perceptual losses removed (Ours ( only)). Using only an loss results in blurry, washed out colors. With the adversarial loss, the colors are less washed out, and with the perceptual loss, colors become sharper; using all losses combines these advantages to achieve compelling scene generation.

Method IoU () Chamfer Dist. ()


OccNet Mescheder et al. (2019) 0.05 8.05
PIFu Saito et al. (2019) 0.06 2.04
Baseline-3D 0.33 1.99
Ours 0.35 0.69
Method IoU () Chamfer Dist. ()


Im2Avatar Sun et al. (2018) 0.17 0.27
PIFu Saito et al. (2019) 0.34 0.27
OccNet Mescheder et al. (2019) 0.46 0.20
Ours 0.66 0.09
Table 4: Evaluation of geometric reconstruction from Matterport3D Chang et al. (2017) scans (left) and ShapeNet Chang et al. (2015) chairs (right). Note that for real scans, unobserved regions in the target are ignored for evaluation.

Evaluation on synthetic 3D shapes.

We additionally evaluate our approach in comparison to state-of-the-art methods on synthetic 3D data, using the chairs category of ShapeNet (5563/619 trainval/test shapes). All methods are provided a single RGB-D frame as input, and for training, the complete shape as target. Tables 4 and 3 show quantitative evaluation for geometry and color predictions, respectively. Our approach predicts more accurate geometry, and our adversarial and perceptual losses provide more compelling color generation.

Figure 6: Qualitative evaluation of our design choices on Matterport3D Chang et al. (2017) scans.

6 Conclusion

We introduce SPSG, a self-supervised approach to generate complete, colored 3D models from incomplete RGB-D scan data. Our 2D view-guided formulation enables self-supervision as well as compelling color generation through 2D adversarial and perceptual losses. Thus we can train and test on real-world scan data where complete ground truth is unavailable, avoiding the large domain gap in using synthetic color and geometry data. We believe this is an exciting avenue for future research, and provides an interesting alternative for synthetic data generation or domain transfer.

This work was supported by the ZD.B (Zentrum Digitalisierung.Bayern), a Google Research Grant, a TUM-IAS Rudolf Mößbauer Fellowship, an NVidia Professorship Award, the ERC Starting Grant Scan2CAD (804724), and the German Research Foundation (DFG) Grant Making Machine Learning on Static and Dynamic 3D Data Practical.


  • D. Brunet, E. R. Vrscay, and Z. Wang (2011) On the mathematical properties of the structural similarity index. IEEE Transactions on Image Processing 21 (4), pp. 1488–1499. Cited by: §5.
  • A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. (2015) Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012. Cited by: Figure 10, §B.3, Figure 5, Table 3, Table 4, §5.
  • A. X. Chang, A. Dai, T. A. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang (2017) Matterport3D: learning from RGB-D data in indoor environments. In 2017 International Conference on 3D Vision, 3DV 2017, Qingdao, China, October 10-12, 2017, pp. 667–676. Cited by: Figure 8, Figure 9, §B.3, Table 5, Figure 4, Figure 6, §5, Table 1, Table 2, Table 4, §5.
  • B. Curless and M. Levoy (1996) A volumetric method for building complex models from range images. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pp. 303–312. Cited by: §3, §4.3.
  • A. Dai, C. Diller, and M. Nießner (2020)

    SG-nn: sparse generative neural networks for self-supervised scene completion of rgb-d scans


    Proc. Computer Vision and Pattern Recognition (CVPR), IEEE

    Cited by: §1, §2, §4.
  • A. Dai, M. Nießner, M. Zollhöfer, S. Izadi, and C. Theobalt (2017a) BundleFusion: real-time globally consistent 3d reconstruction using on-the-fly surface reintegration. ACM Trans. Graph. 36 (3), pp. 24:1–24:18. Cited by: §1, §2.
  • A. Dai, C. R. Qi, and M. Nießner (2017b) Shape completion using 3d-encoder-predictor cnns and shape synthesis. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 6545–6554. Cited by: §1, §2.
  • [8] A. Dai, D. Ritchie, M. Bokeloh, S. Reed, J. Sturm, and M. Nießner ScanComplete: large-scale scene completion and semantic segmentation for 3d scans. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 4578–4587. Cited by: §1, §2.
  • L. A. Gatys, A. S. Ecker, and M. Bethge (2016)

    Image style transfer using convolutional neural networks

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2414–2423. Cited by: §4.2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §1.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §5.
  • J. Huang, A. Dai, L. J. Guibas, and M. Nießner (2017) 3Dlite: towards commodity 3d scanning for content creation.. ACM Trans. Graph. 36 (6), pp. 203–1. Cited by: §2.
  • J. Huang, J. Thies, A. Dai, A. Kundu, C. M. Jiang, L. Guibas, M. Nießner, and T. Funkhouser (2020) Adversarial texture optimization from rgb-d scans. Cited by: §2.
  • P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §4.2.
  • S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. A. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. J. Davison, and A. W. Fitzgibbon (2011) KinectFusion: real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, Santa Barbara, CA, USA, October 16-19, 2011, pp. 559–568. Cited by: §1, §2.
  • T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §1.
  • W. E. Lorensen and H. E. Cline (1987) Marching cubes: A high resolution 3d surface construction algorithm. In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1987, Anaheim, California, USA, July 27-31, 1987, pp. 163–169. Cited by: §3.
  • D. Maturana and S. Scherer (2015) VoxNet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2015, Hamburg, Germany, September 28 - October 2, 2015, pp. 922–928. Cited by: §1.
  • L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger (2019) Occupancy networks: learning 3d reconstruction in function space. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, Table 4.
  • R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. W. Fitzgibbon (2011) KinectFusion: real-time dense surface mapping and tracking. In 10th IEEE International Symposium on Mixed and Augmented Reality, ISMAR 2011, Basel, Switzerland, October 26-29, 2011, pp. 127–136. Cited by: §1, §2.
  • M. Oechsle, L. Mescheder, M. Niemeyer, T. Strauss, and A. Geiger (2019) Texture fields: learning texture representations in function space. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4531–4540. Cited by: §B.3, §2, §5, §5, Table 1, Table 3.
  • J. J. Park, P. Florence, J. Straub, R. A. Newcombe, and S. Lovegrove (2019) DeepSDF: learning continuous signed distance functions for shape representation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 165–174. Cited by: §1, §2.
  • G. Riegler, A. O. Ulusoy, and A. Geiger (2017) OctNet: learning deep 3d representations at high resolutions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 6620–6629. Cited by: §1.
  • S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li (2019) Pifu: pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2304–2314. Cited by: §B.3, §1, §2, §5, Table 1, Table 3, Table 4.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.2.
  • S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. A. Funkhouser (2017) Semantic scene completion from a single depth image. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 190–198. Cited by: §1, §2.
  • Y. Sun, Z. Liu, Y. Wang, and S. E. Sarma (2018) Im2Avatar: colorful 3d reconstruction from a single image. arXiv preprint arXiv:1804.06375. Cited by: §B.3, §1, §2, Table 3, Table 4.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §5.
  • T. Whelan, S. Leutenegger, R. F. Salas-Moreno, B. Glocker, and A. J. Davison (2015) ElasticFusion: dense SLAM without A pose graph. In Robotics: Science and Systems XI, Sapienza University of Rome, Rome, Italy, July 13-17, 2015, Cited by: §1, §2.
  • Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3D shapenets: A deep representation for volumetric shapes. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 1912–1920. Cited by: §2.
  • Q. Xu, W. Wang, D. Ceylan, R. Mech, and U. Neumann (2019) DISN: deep implicit surface network for high-quality single-view 3d reconstruction. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 492–502. Cited by: §2.
  • G. Yang, X. Huang, Z. Hao, M. Liu, S. Belongie, and B. Hariharan (2019) PointFlow: 3d point cloud generation with continuous normalizing flows. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • M. Zollhöfer, P. Stotko, A. Görlitz, C. Theobalt, M. Nießner, R. Klein, and A. Kolb (2018) State of the art on 3d reconstruction with rgb‐d cameras. Computer Graphics Forum 37, pp. 625–652. External Links: Document Cited by: §2.

Appendix A Network Architecture

We detail our network architecture specifications in Figure 7

. Convolution parameters are given as (nf_in, nf_out, kernel_size, stride, padding). Each convolution (except those producing final outputs for geometry and color) is followed by a Leaky ReLU and batch normalization.

Figure 7: Network architecture specification. Given an incomplete RGB-D scan, we take its 3D geometry and color as input, and leverage a fully-convolutional neural network to predict the complete 3D model represented volumetrically for both geometry and color.
Figure 8: Qualitative comparison of our approach using CIELAB color space vs RGB color space on Matterport3D Chang et al. (2017) scans. Using CIELAB space allows us to capture more diversity in output color generation.

Appendix B Additional Results

b.1 Additional Ablation Studies

We additionally evaluate the effect of the CIELAB color space that our approach uses for color generation, in comparison to RGB space. Table 5 quantitatively evaluates the color generation, showing that CIELAB space is more effective, and Figure 8 shows that using CIELAB space allows our approach to capture a greater diversity of colors in our output predictions.

Method SSIM () Feature- () FID ()


Using RGB 0.702 0.222 58.8
Ours 0.709 0.219 56.03
Table 5: Comparison of our approach using CIELAB color space to using RGB on Matterport3D Chang et al. (2017) scans. CIELAB produces more effective color generation.
Figure 9: Additional qualitative evaluation of colored reconstruction on Matterport3D Chang et al. (2017) scans.
Figure 10: Additional qualitative evaluation of colored reconstruction on ShapeNet Chang et al. (2015) chairs.

b.2 Runtime Performance

Since our network architecture is composed of 3D convolutions, we can generate an output prediction in a single forward pass for an input scan, with runtime performance dependent on the 3D volume of the test scene as . A small scene of size meters ( voxels), inference time is seconds; a medium scene of size meters ( voxels) takes seconds, and a large scene of size meters ( voxels) takes seconds.

b.3 Qualitative Results

We provide additional qualitative results of colored reconstruction of Matterport3D Chang et al. (2017) scans and ShapeNet Chang et al. (2015) chairs in Figures 9 and 10, respectively. As can be seen, our method consistently generates sharper results compared to the baseline methods. In Figure 9, the comparison to Oechsle et al. (2019) is shown. Since the approach does not complete geometry, we provide our predicted geometry as input. In contrast to our method, it is not properly estimating color tones like for the green chair in the bottom row of the figure. Figure 10 shows more examples for our experiments on the ShapeNet dataset in comparison to Im2Avatar Sun et al. (2018), PIFu Saito et al. (2019) and Texture Fields Oechsle et al. (2019).