Convolutional Occupancy Networks

03/10/2020 ∙ by Songyou Peng, et al. ∙ 12

Recently, implicit neural representations have gained popularity for learning-based 3D reconstruction. While demonstrating promising results, most implicit approaches are limited to comparably simple geometry of single objects and do not scale to more complicated or large-scale scenes. The key limiting factor of implicit methods is their simple fully-connected network architecture which does not allow for integrating local information in the observations or incorporating inductive biases such as translational equivariance. In this paper, we propose Convolutional Occupancy Networks, a more flexible implicit representation for detailed reconstruction of objects and 3D scenes. By combining convolutional encoders with implicit occupancy decoders, our model incorporates inductive biases and Manhattan-world priors, enabling structured reasoning in 3D space. We investigate the effectiveness of the proposed representation by reconstructing complex geometry from noisy point clouds and low-resolution voxel representations. We empirically find that our method enables fine-grained implicit 3D reconstruction of single objects, scales to large indoor scenes and generalizes well from synthetic to real data.



There are no comments yet.


page 2

page 10

page 13

page 14

Code Repositories


This repository contains the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

view repo


[ECCV'20] Convolutional Occupancy Networks

view repo


A docker image of conv occupancy networks

view repo


Dynamic geometric methods with convolutional occupancy networks

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) Occupancy Network [27] (b) Conv. Occupancy Network
(c) Reconstruction on Matterport3D [1]
Figure 1: Convolutional Occupancy Networks. Traditional implicit models (fig:teaser_a) are limited in their expressiveness due to their fully-connected network structure. We propose Convolutional Occupancy Networks (fig:teaser_b) which exploit convolutions, resulting in scalable and equivariant implicit representations. We query the convolutional features at 3D locations

using linear interpolation. In contrast to Occupancy Networks (ONet) 

[27], the proposed feature representation therefore depends on both the input and the 3D location . Fig. (fig:teaser_c) shows a reconstruction from a noisy point cloud on the Matterport3D dataset [1].

3D reconstruction is a fundamental problem in computer vision with numerous applications. An ideal representation of 3D geometry should have the following properties: a) encode complex geometries and arbitrary topologies, b) scale to large scenes, c) encapsulate local and global information, and d) be tractable in terms of memory and computation.

Unfortunately, current representations for 3D reconstruction do not satisfy all of these requirements. Volumetric representations [26] are limited in terms of resolution due to their large memory requirements. Point clouds [10] are lightweight 3D representations but discard topological relations. Mesh-based representations [15]

are often hard to predict using neural networks.

Recently, several works [27, 32, 3, 28] have introduced deep implicit representations which represent 3D structures using learned occupancy or signed distance functions. In contrast to explicit representations, implicit methods do not discretize 3D space during training, thus resulting in continuous representations of 3D geometry without topology restrictions. While inspiring many follow-up works [42, 29, 31, 13, 12, 24, 30, 25], all existing approaches are limited to single objects and do not scale to larger scenes. The key limiting factor of most implicit models is their simple fully-connected network architecture [27, 32] which neither allows for integrating local information in the observations, nor for incorporating inductive biases such as translation equivariance into the model. This prevents these methods from performing structured reasoning as they only act globally and result in overly smooth reconstructions.

In contrast, translation equivariant convolutional neural networks (CNNs) have demonstrated great success across many 2D recognition tasks including object detection and image segmentation. Moreover, CNNs naturally encode information in a hierarchical manner in different network layers 

[50, 51]. Exploiting these inductive biases is expected to not only benefit 2D but also 3D tasks, e.g., reconstructing 3D shapes of multiple similar chairs located in the same room. In this work, we seek to combine the complementary strengths of convolutional neural networks with those of implicit representations.

Towards this goal, we introduce Convolutional Occupancy Networks, a novel representation for accurate large-scale 3D reconstruction with continuous implicit representations (Fig. 1). We demonstrate that this representation not only preserves fine geometric details, but also enables the reconstruction of complex indoor scenes at scale. Our key idea is to establish rich input features, incorporating inductive biases and integrating local as well as global information. More specifically, we exploit the Manhattan-world assumption [7] to incorporate orientation biases (e.g., the fact that the ground is facing upwards) and convolutional operations to obtain translation equivariance (exploiting the local self-similarity of 3D structures). We systematically investigate multiple design choices, ranging from canonical planes to volumetric representations. Our contributions are summarized as follows:

  • We identify major limitations of current implicit 3D reconstruction methods.

  • We propose a flexible translation equivariant architecture which enables accurate 3D reconstruction from object to scene level.

  • We demonstrate that our model enables generalization from synthetic to real scenes as well as to novel object categories and scenes.

2 Related Work

Learning-based 3D reconstruction methods can be broadly categorized by the output representation they use.

Voxels: Voxel representations are amongst the earliest representations for learning-based 3D reconstruction [47, 5, 46]. Due to the cubic memory requirements of voxel-based representations, several works proposed to operate on multiple scales or use octrees for efficient space partitioning [16, 43, 39, 38, 26, 9]. However, even when using adaptive data structures, voxel-based techniques are still limited in terms of memory and computation.

Point Clouds: An alternative output representation for 3D reconstruction is 3D point clouds which have been used in [10, 22, 49, 35]. However, point cloud-based representations are typically limited in terms of the number of points they can handle. Furthermore, they cannot represent topological relations.

Meshes: A popular alternative is to directly regress the vertices and faces of a mesh [14, 15, 18, 44, 45, 23, 21] using a neural network. While some of these works require deforming a template mesh of fixed topology, others result in non-watertight reconstructions with self-intersecting mesh faces.

Implicit Representations: More recent implicit occupancy [27, 3] and distance field [32, 28]

models use a neural network to infer an occupancy probability or distance value given any 3D point as input. In contrast to the aforementioned explicit representations which require discretization (e.g., in terms of the number of voxels, points or vertices), implicit models represent shapes continuously and naturally handle complicated shape topologies. Implicit models have been adopted for learning implicit representations from images 

[25, 42, 24, 30], for encoding texture information [31], for 4D reconstruction [29] as well as for primitive-based reconstruction [13, 12, 17, 33]. Unfortunately, all these methods are limited to comparably simple 3D geometry of single objects and do not scale to more complicated or large-scale scenes. The key limiting factor is the simple fully-connected network architecture which does not allow for integrating local features or incorporating inductive biases such as translation equivariance.

Notable exceptions are PIFu [41] and DISN [48] which use pixel-aligned implicit representations to reconstruct people in clothing [41] or ShapeNet objects [48]. While these methods also exploit convolutions, all operations are performed in the 2D image domain, restricting these models to image-based inputs and reconstruction of single objects. In contrast, in this work, we propose to aggregate features in physical 3D space, exploiting both 2D and 3D convolutions. Thus, our world-centric representation is independent of the camera viewpoint and input representation. Moreover, we demonstrate for the first time the feasibility of implicit 3D reconstruction at scene-level as illustrated in Fig. 0(c).

In concurrent work111Arxiv version published on March 03, 2020., Chibane et al. [4] present a model similar to our convolutional volume decoder. In contrast to us, they only consider a single variant of convolutional feature embeddings (3D), use lossy discretization for the 3D point cloud encoding and focus on reconstruction of single human body shapes.

3 Method

(a) Plane Encoder (b) Volume Encoder (c) Convolutional Single-Plane Decoder (d) Convolutional Multi-Plane Decoder (e) Convolutional Volume Decoder
Figure 2: Model Overview. The encoder (left) first converts the 3D input (e.g., noisy point clouds or coarse voxel grids) into features using task-specific neural networks. Next, the features are projected onto one or multiple planes (Fig. 1(a)) or into a volume (Fig. 1(b)) using average pooling. The convolutional decoder (right) processes the resulting feature planes/volume using 2D/3D U-Nets to aggregate local and global information. For a query point

, the point-wise feature vector

is obtained via bilinear (Fig. 1(c) and Fig. 1(d)) or trilinear (Fig. 1(e)) interpolation. Given feature vector at location , the occupancy probability is predicted using a fully-connected network .

Our goal is to make implicit 3D representations more expressive. An overview of our model is provided in Fig. 2. We first encode the input (e.g., a point cloud) into a 2D or 3D feature grid (left). These features are processed using convolutional networks and decoded into occupancy probabilities via a fully-connected network. We investigate planar representations (fig:model_a+fig:model_c+fig:model_d), volumetric representations (fig:model_b+fig:model_e) as well as combinations thereof in our experiments. In the following, we explain the encoder (Section 3.1), the decoder (Section 3.2), the occupancy prediction (Section 3.3) and the training procedure (Section 3.4) in more detail.

3.1 Encoder

While our method is independent of the input representation, we focus on 3D inputs to demonstrate the ability of our model in recovering fine details and scaling to large scenes. More specifically, we assume a noisy sparse point cloud (e.g., from structure-from-motion or laser scans) or a coarse occupancy grid as input .

We first process the input with a task-specific neural network to obtain a feature encoding for every point or voxel. We use a one-layer 3D CNN for voxelized inputs, and a shallow PointNet [36] with local pooling for 3D point clouds. Given these features, we construct planar and volumetric feature representations in order to encapsulate local neighborhood information as follows.

Plane Encoder: As illustrated in Fig. 1(a), for each input point, we perform an orthographic projection onto a canonical plane (i.e., a plane aligned with the axes of the coordinate frame) which we discretize at a resolution of pixel cells. For voxel inputs, we treat the voxel center as a point and project it to the plane. We aggregate features projecting onto the same pixel using average pooling, resulting in planar features with dimensionality , where is the feature dimension.

In our experiments, we analyze two variants of our model: one variant where features are projected onto the ground plane, and one variant where features are projected to all three canonical planes. While the former is computationally more efficient, the latter allows for recovering richer geometric structure in the dimension. Furthermore, by aligning the world coordinate frame with the dominant surface orientations of the scene, we effectively incorporate a Manhattan-World prior [7, 11] into our model.

Volume Encoder: While planar feature representations allow for encoding at large spatial resolution ( pixels and beyond), they are restricted to two dimensions. Therefore, we also consider volumetric encodings (see Fig. 1(b)) which better represent 3D information, but are restricted to smaller resolutions (typically voxels in our experiments). Similar to the plane encoder, we perform average pooling, but this time over all features falling into the same voxel cell, resulting in a feature volume of dimensionality .

3.2 Decoder

We endow our model with translation equivariance by processing the feature planes and the feature volume from the encoder using 2D and 3D convolutional hourglass (U-Net) networks [40, 6] which are composed of a series of down- and upsampling convolutions with skip connections to integrate both local and global information. We choose the depth of the U-Net such that the receptive field becomes equal to the size of the respective feature plane or volume.

Our single-plane decoder (Fig. 1(c)) processes the ground plane features with a 2D U-Net. The multi-plane decoder (Fig. 1(d)) processes each feature plane separately using 2D U-Nets with shared weights. Our volume decoder (Fig. 1(e)) uses a 3D U-Net. Since convolution operations are translational equivariant, our output features are also translation equivariant, enabling structured reasoning. Moreover, convolutional operations are able to “inpaint” features while preserving global information, enabling reconstruction from sparse inputs.

3.3 Occupancy Prediction

Given the aggregated feature maps, our goal is to estimate the occupancy probability of any point

in 3D space. For the single-plane decoder, we project each point orthographically onto the ground plane and query the feature value through bilinear interpolation (Fig. 1(c)). For the multi-plane decoder (Fig. 1(d)), we aggregate information from the 3 canonical planes by summing the features of all 3 planes. For the volume decoder, we use trilinear interpolation (Fig. 1(e)).

Denoting the feature vector for input at point as , we predict the occupancy of using a small fully-connected occupancy network:


The network comprises multiple ResNet blocks. We use the network architecture of [30], adding

to the input features of every ResNet block instead of the more memory intensive batch normalization operation proposed in earlier works

[27]. In contrast to [30], we use a feature dimension of for the hidden layers. Details about the network architecture can be found in the supplementary.

3.4 Training and Inference

At training time, we uniformly sample query points within the volume of interest and predict their occupancy values. We apply the binary cross-entropy loss between the predicted and the true occupancy values :


We implement all models in PyTorch 

[34] and use the Adam optimizer [20] with a learning rate of . During inference, we apply Multiresolution IsoSurface Extraction (MISE) [27] to extract meshes given an input . As our model is fully-convolutional, we are able to reconstruct large scenes by applying it in a “sliding-window” fashion at inference time. We exploit this property to obtain reconstructions of entire apartments (see Fig. 1).

4 Experiments

We conduct three types of experiments to evaluate our method. First, we perform object-level reconstruction on ShapeNet [2] chairs, considering noisy point clouds and low-resolution occupancy grids as inputs. Next, we compare our approach against several baselines on the task of scene-level reconstruction using a synthetic indoor dataset of various objects. Finally, we demonstrate synthetic-to-real generalization by evaluating our model on real indoor scenes [8, 1].


ShapeNet [2]: We use the ShapeNet subset, voxelizations, and train/val/test split from Choy et al. [5]. We focus on the “chair” category due to its large intra-class variety. Results for other classes can be found in supplementary.

Synthetic Indoor Scene Dataset: We create a synthetic dataset of 5000 scenes with multiple objects from ShapeNet (chair, sofa, lamp, cabinet, table). A scene consists of a ground plane with randomly sampled width-length ratio, multiple objects with random rotation and scale, and randomly sampled walls.

ScanNet v2 [8]: This dataset contains 1513 real-world rooms captured with an RGB-D camera. We sample point clouds from the provided meshes for testing.

Matterport3D [1]: Matterport3D contains 90 buildings with multiple rooms on different floors captured using a Matterport Pro Camera. Similar to ScanNet, we sample point clouds for evaluating our model on Matterport3D.


ONet [27]: Occupancy Networks is a state-of-the-art implicit 3D reconstruction model. It uses a fully-connected network architecture and a global encoding of the input. We compare against this method in all of our experiments.

PointConv: We construct another simple baseline by extracting point-wise features using PointNet++ [37], interpolating them using Gaussian kernel regression and feeding them into the same fully-connected network used in our approach. While this baseline uses local information, it does not exploit convolutions.

SPSR [19]: Screened Poisson Surface Reconstruction (SPSR) is a traditional 3D reconstruction technique which operates on oriented point clouds as input. Note that in contrast to all other methods, SPSR requires additional surface normals which are often hard to obtain for real-world scenarios.


Following [27], we consider Volumetric IoU, Chamfer Distance as well as Normal Consistency for evaluation. Details can be found in the supplementary.

4.1 Object-Level Reconstruction

We first evaluate our method on the single object reconstruction task. We consider the ShapeNet [2] “chair” class since it is challenging due to its large intra-class variety and details [27]. We consider two different types of 3D inputs: noisy point clouds and low-resolution voxels. For the former, we sample

points from the mesh and apply Gaussian noise with zero mean and standard deviation

. For the latter, we use the coarse voxelizations from [27].

Reconstruction from Point Clouds: Table 1 and Fig. 3 show quantitative and qualitative results. Compared to the baselines, all variants of our method achieve equal or better results on all three metrics. As evidenced by the training progression plot on the right, our method reaches a high validation IoU after only few iterations. This verifies our hypothesis that leveraging convolutions and local features benefits 3D reconstruction in terms of both accuracy and efficiency. The results show that, in comparison to PointConv which directly aggregates features from point clouds, projecting point-features to planes or volumes followed by 2D/3D CNNs is more effective. In addition, decomposing 3D representations from volumes into three planes with higher resolution ( vs. ) improves performance while at the same time requiring less GPU memory.

GPU Memory IoU Chamfer- Normal C.
ONet [27] 7.7G 0.721 0.097 0.884
PointConv 5.1G 0.745 0.085 0.898
Ours-2D () 1.6G 0.780 0.070 0.898
Ours-2D () 2.4G 0.861 0.048 0.937
Ours-3D () 5.9G 0.842 0.052 0.937
Table 1: Object-Level 3D Reconstruction from Point Clouds. Left: We report GPU memory, IoU, Chamfer- distance and Normal Consistency for our approach (2D plane and 3D voxel grid dimensions in brackets), the baselines ONet [27] and PointConv on the ShapeNet “chair” category. Right: The training progression plot shows that our method converges faster than the baselines.
Input ONet [27] PointConv Ours-2D Ours-2D Ours-3D GT mesh
() () ()
Figure 3: Object-Level 3D Reconstruction from Point Clouds. We compare our representation to ONet and PointConv on the ShapeNet “chair” category.

Voxel Super-Resolution:

Besides noisy point clouds, we also evaluate on the task of voxel super-resolution. Here, the goal is to recover high-resolution details from coarse (

) voxelizations of the shape. Table 2 and Fig. 4 show that our volumetric method outperforms our plane-based variants as well as the baselines on this task. However, note that our method with three planes achieves comparable results while requiring only of the GPU memory compared to our volumetric representation. In contrast to reconstruction from point clouds, our single-plane approach fails on this task. We hypothesize that a single plane is not sufficient for resolving ambiguities in the coarse but regularly structured voxel input.

GPU Memory IoU Chamfer- Normal Consistency
Input - 0.622 0.130 0.798
ONet [27] 4.8G 0.670 0.115 0.876
Ours-2D () 2.4G 0.591 0.152 0.853
Ours-2D () 4.0G 0.738 0.084 0.907
Ours-3D () 10.8G 0.746 0.081 0.917
Table 2: Voxel Super-Resolution. 3D reconstruction results from low resolution voxelized inputs ( voxels) on the “chair” category of the ShapeNet dataset.
Input ONet [27] Ours-2D Ours-2D Ours-3D GT mesh
() () ()
Figure 4: Voxel Super-Resolution. Qualitative comparison between our method and ONet using coarse voxelized inputs at resolution voxels.

Generalization: In contrast to the baselines, our method degrades gracefully when evaluated on an object class different from the training category (see Fig. 5). This emphasizes the importance of equivariant representations and geometric reasoning using both local and global features.

ONet [27] PointConv Ours-2D Ours-3D GT mesh
() ()
Figure 5: Generalization (Chair Table). We analyze the generalization performance of our method and the baselines by training them on the ShapeNet “chair” category and evaluating them on the “table” category.

4.2 Scene-Level Reconstruction

To analyze whether our approach can scale to larger scenes, we now reconstruct 3D geometry from point clouds on our synthetic indoor scene dataset. Due to the increasing complexity of the scene, we uniformly sample points and apply Gaussian noise with standard deviation of . For our plane-based methods, we use a resolution to . For our volumetric approach, we investigate both and resolutions. Hypothesizing that the plane and volumetric features are complementary, we also test the combination of the multi-plane and volumetric variants.

Table 3 and Fig. 6 show our results. All variants of our method are able to reconstruct geometric details of the scenes and lead to smooth results. In contrast, ONet and PointConv suffer from low accuracy while SPSR leads to noisy surfaces. While high-resolution canonical plane features capture fine details they are prone to noise. Low-resolution volumetric features are instead more robust to noise, yet produce smoother surfaces. Combining complementary volumetric and plane features improves results compared to considering them in isolation. This confirms our hypothesis that plane-based and volumetric features are complementary. However, the best results in this setting are achieved when increasing the resolution of the volumetric features to .


ONet [27]


SPSR [19]








GT Mesh

Figure 6: Scene-Level Reconstruction on Synthetic Rooms. Qualitative comparison for point-cloud based reconstruction on the synthetic indoor scene dataset.
IoU Chamfer- Normal Consistency
ONet [27] 0.475 0.210 0.783
PointConv 0.523 0.168 0.811
SPSR [19] - 0.223 0.866
SPSR [19] (trimmed) - 0.069 0.890
Ours-2D () 0.793 0.062 0.903
Ours-2D () 0.805 0.057 0.913
Ours-3D () 0.782 0.060 0.911
Ours-3D () 0.849 0.057 0.922
Ours-2D-3D () 0.816 0.057 0.915
Table 3: Scene-Level Reconstruction on Synthetic Rooms. Quantitative comparison for reconstruction from noisy point clouds. We do not report IoU for SPSR as SPSR generates only a single surface for walls and the ground plane.

4.3 Ablation Study

In this section, we investigate on our synthetic indoor scene dataset different feature aggregation strategies at similar GPU memory consumption as well as different feature interpolation strategies.

Performance at Similar GPU Memory: Table 6(a) shows a comparison of different feature aggregation strategies at similar GPU memory utilization. Our multi-plane approach slightly outperforms the single plane and the volumetric approach in this setting. Moreover, the increase in plane resolution for the single plane variant does not result in a clear performance boost, demonstrating that higher resolution does not necessarily guarantee better performance.

GPU Memory IoU Chamfer- Normal C.
Ours-2D () 9.5GB 0.773 0.062 0.905
Ours-2D () 9.3GB 0.805 0.057 0.913
Ours-3D () 8.5GB 0.782 0.060 0.911
(a) Performance at similar GPU Memory
IoU Chamfer- Normal C.
Nearest Neighbor 0.766 0.066 0.903
Bilinear 0.805 0.057 0.913
(b) Interpolation Strategy
Table 4: Ablation Study on Synthetic Rooms. We compare the performance of different feature aggregation strategies at similar GPU memory in Table 6(a) and evaluate two different sampling strategies in Table 6(b).

Feature Interpolation Strategy: To analyze the effect of the feature interpolation strategy in the convolutional decoder of our method, we compare nearest neighbor and bilinear interpolation for our multi-plane variant. The results in Table 6(b) clearly demonstrate the benefit of bilinear interpolation.

4.4 Reconstruction from Point Clouds on Real-World Datasets

Next, we investigate the generalization capabilities of our method. Towards this goal, we evaluate our models trained on the synthetic indoor scene dataset on the real world datasets ScanNet v2 [8] and Matterport3D [1]. Similar to our previous experiments, we use points sampled from the meshes as input.

ScanNet v2: Our results in Table 5 show that among all our variants, the volumetric-based models perform best, indicating that the plane-based approaches are more affected by the domain shift. We find that 3D CNNs are more robust to noise as they aggregate features from all neighbors which results in smooth outputs. Moreover, all variants outperform the learning-based baselines by a significant margin.

ONet [27] 0.398
PointConv 0.316
SPSR [19] 0.293
SPSR [19] (trimmed) 0.086
Ours-2D () 0.139
Ours-2D () 0.141
Ours-3D () 0.095
Ours-3D () 0.079
Ours-2D-3D () 0.099
Table 5: Scene-Level Reconstruction on ScanNet. Evaluation of point-based reconstruction on the real-world ScanNet dataset. As ScanNet does not provide watertight meshes, we trained all methods on the synthetic indoor scene dataset.

The qualitative comparison in Fig. 7 shows that our model is able to smoothly reconstruct scenes with geometric details at various scales. While Screened PSR [19]

also produces reasonable reconstructions, it tends to close the resulting meshes and hence requires a carefully chosen trimming parameter. In contrast, our method does not require additional hyperparameters.


ONet [27]

SPSR [19]






Figure 7: Scene-Level Reconstruction on ScanNet. Qualitative results for point-based reconstruction on ScanNet [8]. All learning-based methods are trained on the synthetic room dataset and evaluated on ScanNet.

Matterport3D Dataset: Finally, we want to investigate if we can scale our approach to larger scenes which comprise multiple rooms at multiple floors. For this experiment, we use the Matterport3D dataset which provides a segmentation into rooms. We sample points on the surface of each room and apply our convolutional occupancy network separately to each room in order to cope with the limited GPU memory. Fig. 1 shows the resulting 3D reconstruction. Our method reconstructs details inside each room while adhering to the room layout. Note that the geometry and point distribution of the Matterport3D dataset differs significantly from the synthetic indoor scene dataset which our model is trained on. This demonstrates that our method is able to generalize not only to unseen classes, but also novel room layouts and sensor characteristics.

5 Conclusion

We introduced Convolutional Occupancy Networks, a novel shape representation which combines the expressiveness of convolutional neural networks with the advantages of implicit representations. We analyzed the tradeoffs between 2D and 3D feature representations and found that incorporating convolutional operations facilitates generalization to unseen classes, novel room layouts and large-scale indoor spaces. While the focus of this work was on learning-based 3D reconstruction, in future work, we plan to apply our novel representation to other domains such as implicit appearance modeling and 4D reconstruction.

Acknowledgement. This work was supported by an NVIDIA research gift. The authors thank Max Planck ETH Center for Learning Systems (CLS) for supporting Songyou Peng and the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Michael Niemeyer.


  • [1] Chang, A., Dai, A., Funkhouser, T., Halber, M., Niessner, M., Savva, M., Song, S., Zeng, A., Zhang, Y.: Matterport3D: Learning from RGB-D data in indoor environments. In: Proc. of the International Conf. on 3D Vision (3DV) (2017)
  • [2] Chang, A.X., Funkhouser, T.A., Guibas, L.J., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., Yu, F.: Shapenet: An information-rich 3d model repository. 1512.03012 (2015)
  • [3]

    Chen, Z., Zhang, H.: Learning implicit fields for generative shape modeling. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2019)

  • [4] Chibane, J., Alldieck, T., Pons-Moll, G.: Implicit functions in feature space for 3d shape reconstruction and completion. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2020)
  • [5] Choy, C.B., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In: Proc. of the European Conf. on Computer Vision (ECCV) (2016)
  • [6] Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3d u-net: Learning dense volumetric segmentation from sparse annotation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI) (2016)
  • [7]

    Coughlan, J.M., Yuille, A.L.: Manhattan world: Compass direction from a single image by bayesian inference. In: Proc. of the IEEE International Conf. on Computer Vision (ICCV) (1999)

  • [8] Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Niessner, M.: Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2017)
  • [9] Dai, A., Qi, C.R., Nießner, M.: Shape completion using 3d-encoder-predictor cnns and shape synthesis. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2017)
  • [10] Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3d object reconstruction from a single image. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2017)
  • [11] Furukawa, Y., Curless, B., Seitz, S.M., Szeliski, R.: Manhattan-world stereo. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2009)
  • [12] Genova, K., Cole, F., Sud, A., Sarna, A., Funkhouser, T.A.: Deep structured implicit functions. 1912.06126 (2019)
  • [13] Genova, K., Cole, F., Vlasic, D., Sarna, A., Freeman, W.T., Funkhouser, T.: Learning shape templates with structured implicit functions. In: Proc. of the IEEE International Conf. on Computer Vision (ICCV) (2019)
  • [14] Gkioxari, G., Malik, J., Johnson, J.: Mesh R-CNN. In: Proc. of the IEEE International Conf. on Computer Vision (ICCV) (2019)
  • [15] Groueix, T., Fisher, M., Kim, V.G., Russell, B.C., Aubry, M.: AtlasNet: A papier-mâché approach to learning 3d surface generation. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2018)
  • [16] Hane, C., Tulsiani, S., Malik, J.: Hierarchical surface prediction for 3d object reconstruction. In: Proc. of the International Conf. on 3D Vision (3DV) (2017)
  • [17] Jeruzalski, T., Deng, B., Norouzi, M., Lewis, J.P., Hinton, G.E., Tagliasacchi, A.: NASA: neural articulated shape approximation. 1912.03207 (2019)
  • [18] Kanazawa, A., Tulsiani, S., Efros, A.A., Malik, J.: Learning category-specific mesh reconstruction from image collections. In: Proc. of the European Conf. on Computer Vision (ECCV) (2018)
  • [19] Kazhdan, M.M., Hoppe, H.: Screened poisson surface reconstruction. ACM Trans. on Graphics 32(3),  29 (2013)
  • [20]

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Proc. of the International Conf. on Machine learning (ICML) (2015)

  • [21] Liao, Y., Donne, S., Geiger, A.: Deep marching cubes: Learning explicit surface representations. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2018)
  • [22]

    Lin, C., Kong, C., Lucey, S.: Learning efficient point cloud generation for dense 3d object reconstruction. In: Proc. of the Conf. on Artificial Intelligence (AAAI) (2018)

  • [23] Lin, C., Wang, O., Russell, B.C., Shechtman, E., Kim, V.G., Fisher, M., Lucey, S.: Photometric mesh optimization for video-aligned 3d object reconstruction. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2019)
  • [24] Liu, S., Zhang, Y., Peng, S., Shi, B., Pollefeys, M., Cui, Z.: DIST: rendering deep implicit signed distance function with differentiable sphere tracing. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2020)
  • [25] Liu, S., Saito, S., Chen, W., Li, H.: Learning to infer implicit surfaces without 3d supervision. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)
  • [26] Maturana, D., Scherer, S.: Voxnet: A 3d convolutional neural network for real-time object recognition. In: Proc. IEEE International Conf. on Intelligent Robots and Systems (IROS) (2015)
  • [27] Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: Learning 3d reconstruction in function space. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2019)
  • [28] Michalkiewicz, M., Pontes, J.K., Jack, D., Baktashmotlagh, M., Eriksson, A.: Implicit surface representations as layers in neural networks. In: Proc. of the IEEE International Conf. on Computer Vision (ICCV) (2019)
  • [29] Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A.: Occupancy flow: 4d reconstruction by learning particle dynamics. In: Proc. of the IEEE International Conf. on Computer Vision (ICCV) (2019)
  • [30] Niemeyer, M., Mescheder, L.M., Oechsle, M., Geiger, A.: Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2020)
  • [31] Oechsle, M., Mescheder, L., Niemeyer, M., Strauss, T., Geiger, A.: Texture fields: Learning texture representations in function space. In: Proc. of the IEEE International Conf. on Computer Vision (ICCV) (2019)
  • [32] Park, J.J., Florence, P., Straub, J., Newcombe, R.A., Lovegrove, S.: Deepsdf: Learning continuous signed distance functions for shape representation. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2019)
  • [33] Paschalidou, D., van Gool, L., Geiger, A.: Learning unsupervised hierarhical part decomposition of 3d objects from a single rgb image. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2020)
  • [34]

    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)

  • [35] Prokudin, S., Lassner, C., Romero, J.: Efficient learning on point clouds with basis point sets. In: Proc. of the IEEE International Conf. on Computer Vision (ICCV) (2019)
  • [36] Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2017)
  • [37] Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems (NeurIPS) (2017)
  • [38] Riegler, G., Ulusoy, A.O., Bischof, H., Geiger, A.: OctNetFusion: Learning depth fusion from data. In: Proc. of the International Conf. on 3D Vision (3DV) (2017)
  • [39] Riegler, G., Ulusoy, A.O., Geiger, A.: Octnet: Learning deep 3d representations at high resolutions. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2017)
  • [40] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI) (2015)
  • [41] Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In: Proc. of the IEEE International Conf. on Computer Vision (ICCV) (2019)
  • [42]

    Sitzmann, V., Zollhöfer, M., Wetzstein, G.: Scene representation networks: Continuous 3d-structure-aware neural scene representations. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)

  • [43] Tatarchenko, M., Dosovitskiy, A., Brox, T.: Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. In: Proc. of the IEEE International Conf. on Computer Vision (ICCV) (2017)
  • [44] Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2mesh: Generating 3d mesh models from single rgb images. In: Proc. of the European Conf. on Computer Vision (ECCV) (2018)
  • [45] Wen, C., Zhang, Y., Li, Z., Fu, Y.: Pixel2mesh++: Multi-view 3d mesh generation via deformation. In: Proc. of the IEEE International Conf. on Computer Vision (ICCV) (2019)
  • [46] Wu, J., Zhang, C., Xue, T., Freeman, B., Tenenbaum, J.: Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In: Advances in Neural Information Processing Systems (NeurIPS) (2016)
  • [47] Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3d shapenets: A deep representation for volumetric shapes. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2015)
  • [48] Xu, Q., Wang, W., Ceylan, D., Mech, R., Neumann, U.: DISN: deep implicit surface network for high-quality single-view 3d reconstruction. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)
  • [49] Yang, G., Huang, X., Hao, Z., Liu, M., Belongie, S.J., Hariharan, B.: Pointflow: 3d point cloud generation with continuous normalizing flows. In: Proc. of the IEEE International Conf. on Computer Vision (ICCV) (2019)
  • [50] Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Proc. of the European Conf. on Computer Vision (ECCV) (2014)
  • [51] Zhang, Q., Wu, Y.N., Zhu, S.: Interpretable convolutional neural networks. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2018)