DeepAI
Log In Sign Up

PatchComplete: Learning Multi-Resolution Patch Priors for 3D Shape Completion on Unseen Categories

While 3D shape representations enable powerful reasoning in many visual and perception applications, learning 3D shape priors tends to be constrained to the specific categories trained on, leading to an inefficient learning process, particularly for general applications with unseen categories. Thus, we propose PatchComplete, which learns effective shape priors based on multi-resolution local patches, which are often more general than full shapes (e.g., chairs and tables often both share legs) and thus enable geometric reasoning about unseen class categories. To learn these shared substructures, we learn multi-resolution patch priors across all train categories, which are then associated to input partial shape observations by attention across the patch priors, and finally decoded into a complete shape reconstruction. Such patch-based priors avoid overfitting to specific train categories and enable reconstruction on entirely unseen categories at test time. We demonstrate the effectiveness of our approach on synthetic ShapeNet data as well as challenging real-scanned objects from ScanNet, which include noise and clutter, improving over state of the art in novel-category shape completion by 19.3 distance on ShapeNet, and 9.0

READ FULL TEXT VIEW PDF

page 16

page 18

page 19

08/04/2020

PatchNets: Patch-Based Generalizable Deep Implicit 3D Shape Representations

Implicit surface representations, such as signed-distance functions, com...
07/24/2022

PatchRD: Detail-Preserving Shape Completion by Learning Patch Retrieval and Deformation

This paper introduces a data-driven shape completion approach that focus...
09/20/2017

Learning quadrangulated patches for 3D shape parameterization and completion

We propose a novel 3D shape parameterization by surface patches, that ar...
12/14/2020

Deep Optimized Priors for 3D Shape Modeling and Reconstruction

Many learning-based approaches have difficulty scaling to unseen data, a...
04/01/2021

Fostering Generalization in Single-view 3D Reconstruction by Learning a Hierarchy of Local and Global Shape Priors

Single-view 3D object reconstruction has seen much progress, yet methods...
01/12/2017

Surface Reconstruction with Data-driven Exemplar Priors

In this paper, we propose a framework to reconstruct 3D models from raw ...
01/06/2020

Meshlet Priors for 3D Mesh Reconstruction

Estimating a mesh from an unordered set of sparse, noisy 3D points is a ...

1 Introduction

Figure 1: PatchComplete learns strong local priors for 3D shape completion by constructing multi-resolution patch priors from train shapes, which can then be applied for effective shape completion on unseen categories.

The prevalence of commodity RGB-D sensors (e.g., Intel RealSense, Microsoft Kinect, iPhone, etc.) has enabled significant progress in 3D reconstruction, achieving impressive tracking quality Newcombe et al. (2011); Izadi et al. (2011); Nießner et al. (2013); Choi et al. (2015); Whelan et al. (2015); Dai et al. (2017b) and even large-scale reconstructed datasets Dai et al. (2017a); Chang et al. (2017). Unfortunately, 3D scanned reconstructions remain limited in geometric quality due to clutter, noise, and incompleteness (e.g., as seen in the objects in Figure 1). Understanding complete object structures is fundamental towards constructing effective 3D representations, that can then be used to fuel many applications in robotic perception, mixed reality, content creation, and more.

Recently, significant progress has been made in shape reconstruction, across a variety of 3D representations, including voxels Choy et al. (2016); Dai et al. (2017c); Tatarchenko et al. (2017), points Fan et al. (2017); Yang et al. (2019), meshes Wang et al. (2018); Dai and Nießner (2019), and implicit field representations Park et al. (2019); Mescheder et al. (2019). However, these methods tend to rely heavily on strong synthetic supervision, producing impressive reconstructions on similar train class categories but struggling to generalize to unseen categories. This leads to an expensive compute and data requirement in adapting to new objects in different scenarios, whose class categories may not necessarily have been seen during training and so must be re-trained or fine-tuned for.

In order to encourage more generalizable 3D feature learning to represent shape characteristics, we observe that while different class categories may have very different global structures, local geometric structures are often shared (e.g., a long, thin structure could represent a chair leg, a table leg, a lamp rod, etc.). We thus propose to learn a set of multi-resolution patch-based priors that captures such shared local substructures across the training set of shapes, which can be applied to shapes outside of the train set of categories. Our local patch-based priors can thus capture shared local structures, across different resolutions, that enable effective shape completion on novel class categories of not only synthetic data but also challenging real-world observations with noise and clutter.

We propose PatchComplete, which first learns patch priors for shape completion by correlating regions of observed partial inputs to the learned patch priors through an attention-based association, and decoding to reconstruct a complete shape. These patch priors are learned at different resolutions to encompass potentially different sizes of local substructures; we then learn to fuse the multi-resolution priors together to reconstruct the output complete shape. This enables learning generalizable local 3D priors that facilitate effective shape completion even for unseen categories, outperforming state of the art in synthetic and real-world observations by 19.3% and 9.0% on Chamfer Distance.

In summary, our contributions are:

  • [leftmargin=*]

  • We propose generalizable 3D shape priors by learning patch-based priors that characterize shared local substructures that can be associated with input observations by cross-attention. This intermediate representation preserves structure explicitly, and can be effectively leveraged to compose complete shapes for entirely unseen categories.

  • We design a multi-resolution fusion of different patch priors at various resolutions in order to effectively reconstruct a complete shape, enabling multi-resolution reasoning about the most informative learned patch priors to recover both global and local shape structures.

2 Related Work

2.1 3D Shape Reconstruction and Completion

Understanding how to reconstruct 3D shapes is an essential task for 3D machine perception. In particular, the task of shape completion to predict a complete shape from partial input observations has been studied by various works toward understanding 3D shape structures. Recently, many works have leveraged deep learning techniques to learn strong data-driven priors for shape completion, focusing on various different representations, e.g. volumetric grids 

Wu et al. (2015); Dai et al. (2017c), continuous implicit functions Peng et al. (2020); Mescheder et al. (2019); Park et al. (2019); Chibane et al. (2020); Tretschk et al. (2021), point clouds Stutz and Geiger (2020); Tang et al. (2021), and meshes Dai and Nießner (2019); Li et al. (2020). These works tend to focus on strong synthetic supervision on a small set of train categories, achieving impressive performance on unseen shapes from train categories, but often struggling to generalize to unseen classes. We focus on learning more generalizable, local shape priors in order to effectively reconstruct complete shapes on unseen class categories.

2.2 Few-Shot and Zero-Shot 3D Shape Reconstruction

Several works have been developed to tackle the challenging task of few-shot or zero-shot shape reconstruction, as observations in-the-wild often contain a wide diversity of objects. In the few-shot scenario where several examples of novel categories are available, Wallace and Hariharan Wallace and Hariharan (2019) learn to refine a given category prior. Michalkiewicz et al. Michalkiewicz et al. (2021) further propose to learn compositional shape priors for single-view reconstruction.

In the zero-shot scenario without any examples seen for novel categories, Naeem et al. Naeem et al. (2021) learn priors from seen categories to generate segmentation masks for unseen categories. Zhang et al. Zhang et al. (2018) additionally proposed to use spherical map representations to learn priors for the reconstruction of novel categories. Thai et al. Thai et al. (2021) recently developed an approach to transfer knowledge from an RGB image for shape reconstruction. We also tackle a zero-shot shape reconstruction task, by learning a multi-resolution set of strong local shape priors to compose a reconstructed shape.

Several recent works have explored learning shape priors by leveraging a VQ-VAE backbone with autoregressive prediction to perform shape reconstruction Mittal et al. (2022); Yan et al. (2022). In contrast, we propose to learn multi-resolution shape priors without requiring any sequence interpretation, which enables direct applicability to real-world scan data that often contains noise and clutter. Additionally, hierarchical reconstruction has shown promising results for shape reconstruction Bechtold et al. (2021); Chen et al. (2021). Our approach also takes a multi-resolution perspective, but learns explicit shape and local patch priors and their correlation to input partial observations for robust shape completion.

3 Method

3.1 Overview

Figure 2: Overview of our approach. (a) shows our multi-resolution prior learning for shape completion. Each dotted block indicates patch prior learning for a single resolution, and the three different resolution encodings are then fused in a multi-resolution decoder that outputs a complete shape as an SDF grid. (b) illustrates our input partial observations and local input patches, from which we learn mappings to learned patch priors as to how to best compose the complete shape.

Our method aims to learn effective 3D shape priors that enable general shape completion from partial input scan data, on novel class categories not seen during training. Key to our approach is the observation that 3D shapes often share repeated local patterns – for instance, chairs, tables, and nightstands all share a support surface, and chair or table legs can share a similar structure with lamp rods. Inspired by this, we regard a complete object as a set of substructures, where each substructure represents a local geometric region. We thus propose PatchComplete to learn such local priors and assemble them into a complete shape from a partial scan. An overview of our approach is shown in Figure. 2.

3.2 Learning Local Patch-Based Shape Priors

Figure 3: Network architecture for local patch-based shape prior learning under a single resolution. We learn to build mappings from local regions in a partial object scan to local priors based on complete learnable shape priors . An input encoder takes the input of to encode its local features (represented by patches ). Analogously, we adopt a prior encoder to process each prior in to construct a local prior feature pool . In parallel, we chunk priors in into patch volumes , which are correspondingly fused by the attention to input patches to compose the most informative patch priors: we use each incomplete local patch in to query the keys from complete prior patches, and assemble their corresponding patch volumes for shape completion.

We first learn local shape priors from ground-truth train objects. We represent both the input partial scan and the ground-truth shape as 3D truncated signed distance field grids of size . Figure 3 illustrates the process of learning local shape priors. We learn to build mappings from local regions in incomplete input observations to local priors based on complete shapes, in order to robustly reconstruct the complete shape output.

We first aim to learn patch-based priors, which we extract from learnable global shape priors . We denote as the set of ground-truth train objects in the -th category. These priors are initialized per train category based on mean-shift clustering within shapes in . Thus is a set of representative samples in each category, which are encoded parallel to the input scan .

Both encoders are analogously structured as 3D convolutional encoders which spatially downsample by a factor of , resulting in a 3D encoding of size ; . The input encoding (from ) are uniformly chunked into patches , =1,…,; similarly, the encoded priors (from ) are chunked into patches .

We then use each local part to query for the most informative encodings of complete local regions , building this mapping based on cross-attention Vaswani et al. (2017). Then for each input patch, we calculate its similarity with all local patches in representative shapes on training categories:

(1)

where

is the dimension of the encoded vectors of

; is the category number. We then reconstruct complete shape patches by:

(2)

where is the set of chunks (of resolution ) from shapes in all categories , where each is paired with . We can then recompose to the predicted full shape .

Loss.

We use an reconstruction loss to train the learned local patch priors. Note that are learned along with the network weights, enabling learning the most effective global shape priors for the shape completion task.

3.3 Multi-Resolution Patch Learning

Figure 4: Network architecture for multi-resolution learning pipeline. We generate a complete shape from a partial input scan by fusing input local features and learned local priors in a multi-scale fashion. We first extract the input local features and the learned local priors using the prior learning model () in Section 3.2 under different resolutions (=32, 8, 4). Then, we use attention (Equation 3) to generate intermediate features , which we then recursively fuse to decode a complete shape.

In Section 3.2, we learn to complete shapes in a single patch resolution . Since local substructures may exist at various scales, we learn patch priors under varying resolutions, which we then use to decode a complete shape. We use three patch resolutions (=4, 8, 32), for which we learn patch priors. This results in three pairs of trained {input encoder, prior encoder} (see Figure 3). Given a partial input scan , each pair outputs a set of 1) input patch features , and 2) prior patch features , under the resolution of =4, 8, 32. In this section, we decode a complete shape from these multi-resolution patch priors.

Since stores the patch priors, we use it as the (key, value) term into an attention module, where each input patch feature is used to query for the most informative prior features in under different resolutions, from which we complete partial patches in feature spaces with a multi-scale fashion. We formulate this process by

(3)

where we concatenate input patch feature with the attention result to compensate information loss on observable regions. This outputs as the intermediate feature of the -th patch; . denotes the number of patches under the resolution ; . We recompose all generated patch features into a volume . Note that the dimension of each grid feature in and equals to (see Eq. 1). Then is with the dimension of .

We then sequentially use 3D convolution decoders to upsample and concatenate from low to high resolution to fuse all the shape features as in

(4)

In Eq. 4, has a lower resolution than . We then use deconvolution layers to upsample to match the resolution of , and then concatenate them together. We recursively fuse into , which produces as the final fusion result with the resolution of . An extra upsampling followed by a convolution layer is adopted to upsample into our shape prediction with the dimension of .

In training the feature fusion, we fix all parameters from the {input encoder, prior encoder} under the three resolutions, since they are pre-trained on Section 3.2 and provide better-learned priors and attention maps under these strong constraints. The whole pipeline for this section can be found in Figure 4.

We use an loss to supervise the TSDF value regression. We weight the loss to penalize false predictions based on the predicted signs as Eq. 5

shows, which represents whether this voxel grid is occupied or not. It is also used in the loss function in Section 

3.2.

(5)

In Eq. 5, represents the false positive TSDF values, where the ground truth has negative signs and the prediction has positive signs, which in general indicates the missing predictions. represents false negative TSDF values, where the ground truth has positive signs and the prediction has negative signs, which indicates extra predictions. means those predicted TSDF values with the same signs as the ground-truth. During training, we choose the weight for false positive () as 5, and the weight for false negative () as 3.

3.4 Implementation Details

We train our approach on a single NVIDIA A6000 GPU, using an Adam optimizer with batch size 32 for a synthetic dataset and batch size 16 for a real-world dataset, and an initial learning rate of 0.001. We train for 80 epochs until convergence, and then decrease the learning rate by half after epoch 50. We use the same settings for learning priors and the multi-resolution decoding, which train for 4 and 15.5 hours respectively. For additional network architecture details, we refer to the supplemental material.

Note that for training for real scan data, we first pre-train on synthetic data and then fine-tune only the input encoder.

4 Experiments and Analysis

4.1 Experimental Setup

Datasets.

We train and evaluate our approach on synthetic shape data from ShapeNet Chang et al. (2015) as well as on challenging real-world scan data from ScanNet Dai et al. (2017a). For ShapeNet data, we virtually scan the objects to create partial input scans, following Nie et al. (2020a); Dai et al. (2017c). We use 18 categories during training, and test on 8 novel categories, resulting in 3,202/1,325 train/test models with 4 partial scans for each model.

For real data, we use real-scanned objects from ScanNet extracted by their bounding boxes, with corresponding complete target data given by Scan2CAD Avetisyan et al. (2018). We use 8 categories for training, comprising 7,537 train samples, and test on 6 novel categories of 1,191 test samples.

For all experiments, objects are represented as signed distance grids with truncation of 2.5 voxel units for ShapeNet and 3 voxel units for ScanNet. The objects in ShapeNet are normalized into the unit cube, while we keep the scaling for ScanNet objects, and save their voxel sizes separately to keep the real size information. Additionally, to train and evaluate on real data, all methods are first pre-trained on ShapeNet and then fine-tuned on ScanNet.

Baselines.

We evaluate our approach against various state-of-the-art shape completion methods. We compare with state-of-the-art shape completion methods 3D-EPN Dai et al. (2017c) and IF-Nets Chibane et al. (2020), which learn effective shape completion on dense voxel grids and with implicit neural field representations, respectively, without any focus on unseen class categories. We further compare to the state-of-the-art few-shot shape reconstruction approach of Wallace and Hariharan Wallace and Hariharan (2019) (referred to as Few-Shot) leveraging global shape priors, which we apply in our zero-shot unseen category scenario. Finally, AutoSDF Mittal et al. (2022)

uses a VQ-VAE module with a transformer-based autoregressive model over latent patch priors to produce TSDF shape reconstruction.

Evaluation Metrics.

To evaluate the quality of reconstructed shape geometry, we use Chamfer Distance (CD) and Intersection over Union (IoU) between predicted and ground truth shapes. To evaluate methods that output occupancy grids, we use occupancy thresholds used by the respective methods to obtain voxel predictions, i.e. 0.4 for Wallace and Hariharan (2019) and 0.5 for Chibane et al. (2020). To evaluate methods that output signed distance fields, we extract the iso-surface at level zero with marching cubes Lorensen and Cline (1987). 10K points are sampled on surfaces for CD calculation. Both Chamfer distance and IoU are evaluated on objects in the canonical system, and we report Chamfer Distance scaled by .

Chamfer Distance () IoU
3D-EPN Dai et al. (2017c) Few-Shot Wallace and Hariharan (2019) IF-Nets Chibane et al. (2020) Auto-SDF Mittal et al. (2022) Ours 3D-EPN Dai et al. (2017c) Few-Shot Wallace and Hariharan (2019) IF-Nets Chibane et al. (2020) Auto-SDF Mittal et al. (2022) Ours
Bag 5.01 8.00 4.77 5.81 3.94 0.738 0.561 0.698 0.563 0.776
Lamp 8.07 15.10 5.70 6.57 4.68 0.472 0.254 0.508 0.391 0.564
Bathtub 4.21 7.05 4.72 5.17 3.78 0.579 0.457 0.550 0.410 0.663
Bed 5.84 10.03 5.34 6.01 4.49 0.584 0.396 0.607 0.446 0.668
Basket 7.90 8.72 4.44 6.70 5.15 0.540 0.406 0.502 0.398 0.610
Printer 5.15 9.26 5.83 7.52 4.63 0.736 0.567 0.705 0.499 0.776
Laptop 3.90 10.35 6.47 4.81 3.77 0.620 0.313 0.583 0.511 0.638
Bench 4.54 8.11 5.03 4.31 3.70 0.483 0.272 0.497 0.395 0.539
Inst- Avg 5.48 9.75 5.37 5.76 4.23 0.582 0.386 0.574 0.446 0.644
Cat- Avg 5.58 9.58 5.29 5.86 4.27 0.594 0.403 0.581 0.452 0.654
Table 1: Quantitative comparison for shape completion on synthetic ShapeNet Chang et al. (2015) data.

4.2 Evaluation on Synthetic Data

In Table 1, we evaluate our approach in comparison with prior arts on unseen class categories of synthetic ShapeNet Chang et al. (2015) data. Our approach on learning attention-based correlation to learned local shape priors results in notably improved reconstruction performance, with coherent global and local structures, as shown in Figure 5. In Table 1, our work outperforms other baselines both instance-wise and category-wise. One of the key factors is that our method learns multi-scale patch information from seen categories to complete unseen categories with enough flexibility, while most of the other baselines are designed for 3D shape completion on known categories, which hardly leverage shape priors across categories.

Figure 5: Qualitative comparison for shape completion on synthetic ShapeNet Chang et al. (2015) dataset.

4.3 Evaluation on Real Scan Data

Table 2 evaluates our approach in comparison with prior arts on real scanned objects from unseen categories in ScanNet Dai et al. (2017a). Here, input scans are not only partial but often contain noise and clutter; our multi-resolution learned priors enable more robust shape completion in this challenging scenario. Results in Figure 6 further demonstrate that our approach presents more coherent shape completion than prior methods by using cross-attention with learnable priors, which better preserves the global structures in coarse and cluttered environments.

Chamfer Distance () IoU
3D-EPN Dai et al. (2017c) Few-Shot Wallace and Hariharan (2019) IF-Nets Chibane et al. (2020) Auto-SDF Mittal et al. (2022) Ours 3D-EPN Dai et al. (2017c) Few-Shot Wallace and Hariharan (2019) IF-Nets Chibane et al. (2020) Auto-SDF Mittal et al. (2022) Ours
Bag 8.83 9.10 8.96 9.30 8.23 0.537 0.449 0.442 0.487 0.583
Lamp 14.27 11.88 10.16 11.17 9.42 0.207 0.196 0.249 0.244 0.284
Bathtub 7.56 7.77 7.19 7.84 6.77 0.410 0.382 0.395 0.366 0.480
Bed 7.76 9.07 8.24 7.91 7.24 0.478 0.349 0.449 0.380 0.484
Basket 7.74 8.02 6.74 7.54 6.60 0.365 0.343 0.427 0.361 0.455
Printer 8.36 8.30 8.28 9.66 6.84 0.630 0.622 0.607 0.499 0.705
Inst- Avg 8.60 8.83 8.12 8.56 7.38 0.441 0.387 0.426 0.386 0.498
Cat- Avg 9.09 9.02 8.26 8.90 7.52 0.440 0.386 0.426 0.389 0.495
Table 2: Quantitative comparison with state of the art on real-world ScanNet shape completion.
Figure 6: Shape completion on real-world ScanNet Dai et al. (2017a) object scans. Our method to learn mappings to multi-resolution learnable patch priors enables more coherent shape completion on novel categories.

4.4 Ablation Analysis

ShapeNet Chang et al. (2015) ScanNet Dai et al. (2017a)
Inst-CD Cat-CD Inst-IoU Cat-IoU Inst-CD Cat-CD Inst-IoU Cat-IoU
Ours ( priors only) 11.94 11.61 0.35 0.37 10.43 11.23 0.41 0.40
Ours ( priors only) 4.86 4.89 0.61 0.62 7.64 7.81 0.48 0.49
Ours ( priors only) 4.44 4.50 0.64 0.64 7.34 7.58 0.49 0.49
Ours 4.23 4.27 0.64 0.65 7.38 7.52 0.50 0.50
Table 3: Ablation study on different patch resolutions. A multi-resolution approach gains benefits from both global and local reasoning.

Does multi-resolution patch learning help shape completion for novel categories?

In Table 3, we evaluate shape completion with each resolution in comparison with our multi-resolution approach. Learning only global shape priors (i.e., ) tends to overfit to seen train categories, while the local patch resolutions can provide more generalizable priors. Combining all results in complementary feature learning for the most effective shape completion results.

Does cross-attention to learn local priors help?

We evaluate our approach to learn both local priors and their correlation to input observations with cross-attention in Table 4, which shows that this enables more effective shape completion on unseen categories.

ShapeNet Chang et al. (2015) ScanNet Dai et al. (2017a)
Inst-CD Cat-CD Inst-IoU Cat-IoU Inst-CD Cat-CD Inst-IoU Cat-IoU
Ours (no attention) 4.90 4.98 0.61 0.62 7.80 8.09 0.49 0.48
Ours 4.69 4.74 0.61 0.63 7.58 7.84 0.48 0.49
Table 4: Ablation study on attention used to learn input and patch prior correlations.
Inst-CD Cat-CD Inst-IoU Cat-IoU
Scratch 7.61 7.73 0.50 0.50
Ours 7.38 7.52 0.50 0.50
Table 5: Effect of synthetic pre-training on real-world ScanNet object completion vs. training from scratch.

What is the effect of synthetic pre-training for real scan completion?

Table 5 shows the effect of synthetic pre-training for shape completion on real scanned objects. This encourages learning more robust priors to output cleaner local structures as given in the synthetic data, resulting in improved performance on real scanned objects.

Inst-CD Cat-CD Inst-IoU Cat-IoU
Ours (fixed priors) 4.31 4.34 0.64 0.65
Ours 4.23 4.27 0.64 0.65
Table 6: Ablation on learnable priors in comparison with fixed priors on ShapeNet.

Does learning the priors help completion?

In Table 6, we evaluate our learnable priors in comparison with using fixed priors (by mean-shift clustering of train objects) for shape completion on ShapeNet Chang et al. (2015). Learned priors receive gradient information to adapt to best reconstruct the shapes, enabling improved performance over a fixed set of priors.

4.5 Limitations

While PatchComplete has presented a promising step towards learning more generalizable shape priors, various limitations remain. For instance, output shape completion is limited by the dense voxel grid resolution in representing fine-scale geometric details. Additionally, detected object bounding boxes are required for real scan data as input to independent shape completion predictions, while formulation considering other objects in the scene or an end-to-end framework could learn more effectively from global scene contextual information.

5 Conclusion

We have proposed PatchComplete to learn effective local shape priors for shape completion that enables robust reconstruction for novel class categories at test time. This enables learning shared local substructures across a variety of shapes that can be mapped to local incomplete observations by cross attention, with multi-resolution fusion producing coherent geometry at global and local scales. This enables robust shape completion even for unseen categories with different global structures, across synthetic as well as challenging real-world scanned objects with noise and clutter. We believe that such robust reconstruction of real scanned objects takes an important step towards understanding 3D shapes, and hope this inspires future work in understanding real-world shapes and scene structures.

Acknowledgements

This project is funded by the Bavarian State Ministry of Science and the Arts and coordinated by the Bavarian Research Institute for Digital Transformation (bidt).

References

  • A. Avetisyan, M. Dahnert, A. Dai, M. Savva, A. X. Chang, and M. Nießner (2018) Scan2CAD: learning cad model alignment in rgb-d scans. External Links: 1811.11187 Cited by: Appendix A, §4.1.
  • J. Bechtold, M. Tatarchenko, V. Fischer, and T. Brox (2021) Fostering generalization in single-view 3d reconstruction by learning a hierarchy of local and global shape priors. External Links: 2104.00476 Cited by: §2.2.
  • A. X. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang (2017) Matterport3D: learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV). Cited by: §1.
  • A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu (2015) ShapeNet: an information-rich 3d model repository. External Links: 1512.03012 Cited by: Appendix A, Table 7, Figure 9, Figure 5, §4.1, §4.2, §4.4, Table 1, Table 3, Table 4.
  • Z. Chen, Y. Zhang, K. Genova, S. Fanello, S. Bouaziz, C. Haene, R. Du, C. Keskin, T. Funkhouser, and D. Tang (2021) Multiresolution deep implicit functions for 3d shape representation. External Links: 2109.05591 Cited by: §2.2.
  • J. Chibane, T. Alldieck, and G. Pons-Moll (2020) Implicit functions in feature space for 3d shape reconstruction and completion. External Links: 2003.01456 Cited by: Appendix B, Table 7, Table 8, Table 9, §2.1, §4.1, §4.1, Table 1, Table 2.
  • S. Choi, Q. Zhou, and V. Koltun (2015) Robust reconstruction of indoor scenes. In

    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 5556–5565. Cited by: §1.
  • C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese (2016) 3d-r2n2: a unified approach for single and multi-view 3d object reconstruction. In European conference on computer vision, pp. 628–644. Cited by: §1.
  • B. Curless and M. Levoy (1996) A volumetric method for building complex models from range images. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pp. 303–312. Cited by: Appendix A.
  • A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017a) ScanNet: richly-annotated 3d reconstructions of indoor scenes. External Links: 1702.04405 Cited by: Appendix A, Table 8, Table 9, Appendix D, Figure 10, §1, Figure 6, §4.1, §4.3, Table 3, Table 4.
  • A. Dai, M. Nießner, M. Zollhöfer, S. Izadi, and C. Theobalt (2017b) BundleFusion: real-time globally consistent 3d reconstruction using on-the-fly surface reintegration. ACM Transactions on Graphics (TOG) 36 (3), pp. 24. Cited by: §1.
  • A. Dai and M. Nießner (2019) Scan2mesh: from unstructured range scans to 3d meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5574–5583. Cited by: §1, §2.1.
  • A. Dai, C. R. Qi, and M. Nießner (2017c) Shape completion using 3d-encoder-predictor cnns and shape synthesis. External Links: 1612.00101 Cited by: Appendix B, Table 7, Table 8, Table 9, §1, §2.1, §4.1, §4.1, Table 1, Table 2.
  • H. Fan, H. Su, and L. J. Guibas (2017) A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 605–613. Cited by: §1.
  • S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, et al. (2011) KinectFusion: real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pp. 559–568. Cited by: §1.
  • X. Li, S. Liu, K. Kim, S. D. Mello, V. Jampani, M. Yang, and J. Kautz (2020) Self-supervised single-view 3d reconstruction via semantic consistency. In European Conference on Computer Vision, pp. 677–693. Cited by: §2.1.
  • W. E. Lorensen and H. E. Cline (1987) Marching cubes: a high resolution 3d surface construction algorithm. ACM siggraph computer graphics 21 (4), pp. 163–169. Cited by: §4.1.
  • L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger (2019) Occupancy networks: learning 3d reconstruction in function space. External Links: 1812.03828 Cited by: §1, §2.1.
  • M. Michalkiewicz, S. Tsogkas, S. Parisot, M. Baktashmotlagh, A. Eriksson, and E. Belilovsky (2021) Learning compositional shape priors for few-shot 3d reconstruction. External Links: 2106.06440 Cited by: §2.2.
  • P. Mittal, Y. Cheng, M. Singh, and S. Tulsiani (2022) AutoSDF: shape priors for 3d completion, reconstruction and generation. In CVPR, Cited by: Appendix B, Table 7, Table 8, Table 9, §2.2, §4.1, Table 1, Table 2.
  • M. F. Naeem, E. P. Örnek, Y. Xian, L. V. Gool, and F. Tombari (2021) 3D compositional zero-shot learning with decompositional consensus. External Links: 2111.14673 Cited by: §2.2.
  • R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon (2011) KinectFusion: real-time dense surface mapping and tracking. In Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on, pp. 127–136. Cited by: §1.
  • Y. Nie, Y. Lin, X. Han, S. Guo, J. Chang, S. Cui, J. Zhang, et al. (2020a) Skeleton-bridged point completion: from global inference to local adjustment. Advances in Neural Information Processing Systems 33, pp. 16119–16130. Cited by: §4.1.
  • Y. Nie, Y. Lin, X. Han, S. Guo, J. Chang, S. Cui, and J. Zhang (2020b) Skeleton-bridged point completion: from global inference to local adjustment. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 16119–16130. Cited by: Appendix A.
  • M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger (2013) Real-time 3d reconstruction at scale using voxel hashing. ACM Transactions on Graphics (TOG). Cited by: §1.
  • J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove (2019) DeepSDF: learning continuous signed distance functions for shape representation. External Links: 1901.05103 Cited by: §1, §2.1.
  • S. Peng, M. Niemeyer, L. Mescheder, M. Pollefeys, and A. Geiger (2020) Convolutional occupancy networks. External Links: 2003.04618 Cited by: §2.1.
  • D. Stutz and A. Geiger (2020) Learning 3d shape completion under weak supervision. International Journal of Computer Vision 128 (5), pp. 1162–1181. Cited by: §2.1.
  • J. Tang, J. Lei, D. Xu, F. Ma, K. Jia, and L. Zhang (2021) SA-convonet: sign-agnostic optimization of convolutional occupancy networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6504–6513. Cited by: §2.1.
  • M. Tatarchenko, A. Dosovitskiy, and T. Brox (2017) Octree generating networks: efficient convolutional architectures for high-resolution 3d outputs. In Proceedings of the IEEE international conference on computer vision, pp. 2088–2096. Cited by: §1.
  • A. Thai, S. Stojanov, V. Upadhya, and J. M. Rehg (2021) 3D reconstruction of novel object shapes from single images. External Links: 2006.07752 Cited by: §2.2.
  • E. Tretschk, A. Tewari, V. Golyanik, M. Zollhöfer, C. Stoll, and C. Theobalt (2021) PatchNets: patch-based generalizable deep implicit 3d shape representations. External Links: 2008.01639 Cited by: §2.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §3.2.
  • B. Wallace and B. Hariharan (2019) Few-shot generalization for single-image 3d reconstruction via priors. External Links: 1909.01205 Cited by: Appendix B, Table 7, Table 8, Table 9, §2.2, §4.1, §4.1, Table 1, Table 2.
  • N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y. Jiang (2018) Pixel2mesh: generating 3d mesh models from single rgb images. In Proceedings of the European conference on computer vision (ECCV), pp. 52–67. Cited by: §1.
  • T. Whelan, S. Leutenegger, R. F. Salas-Moreno, B. Glocker, and A. J. Davison (2015) ElasticFusion: dense slam without a pose graph. Proc. Robotics: Science and Systems, Rome, Italy. Cited by: §1.
  • Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao (2015) 3D shapenets: a deep representation for volumetric shapes. External Links: 1406.5670 Cited by: §2.1.
  • X. Yan, L. Lin, N. J. Mitra, D. Lischinski, D. Cohen-Or, and H. Huang (2022) ShapeFormer: transformer-based shape completion via sparse representation. arXiv preprint arXiv:2201.10326. Cited by: §2.2.
  • G. Yang, X. Huang, Z. Hao, M. Liu, S. Belongie, and B. Hariharan (2019) Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4541–4550. Cited by: §1.
  • X. Zhang, Z. Zhang, C. Zhang, J. B. Tenenbaum, W. T. Freeman, and J. Wu (2018) Learning to reconstruct shapes from unseen classes. External Links: 1812.11166 Cited by: §2.2.

Appendix A Date Generation

ShapeNet Chang et al. (2015)

We use ShapeNet111The license can be found here: https://shapenet.org/, received permission after registration without personally identifiable information or offensive content to test our performance on synthetic data. In order to generate watertight meshes as ground truth, we first normalize ShapeNet CAD models, and render depth maps under 20 different viewpoints for each model. We then use volumetric fusion Curless and Levoy (1996) to generate truncated signed distance fields (TSDFs) with truncation value as 2.5 voxel units. Finally, we choose 4 views of TSDF as the input, which mimic the partial scan in real data (e.g., ScanNet). The main idea can be referred to Nie et al. (2020b)222https://github.com/yinyunie/depth_renderer.

We split the training and testing object categories on ShapeNet as follows. The 18 training categories are table, chair, sofa, cabinet, clock, bookshelf, piano, microwave, stove, file cabinet, trash bin, bowl, display, keyboard, dishwasher, washing machine, pots, faucet, and guitar; and the 8 novel testing categories are bathtub, lamp, bed, bag, printer, laptop, bench, and basket.

ScanNet Dai et al. (2017a)

We use ScanNet333The license can be found here: https://github.com/ScanNet/ScanNet, filled out an agreement without personally identifiable information or offensive content to test our method on real-world data. The inputs are directly extracted from ScanNet scenes based on the bounding box annotations from Scan2CAD Avetisyan et al. (2018)444The license can be found here::https://github.com/skanti/Scan2CAD, filled out an agreement without personally identifiable information or offensive content. We keep their real scale and convert them to voxel grids with truncation value at 3 voxel units, and save their voxel size separately. These inputs could contain walls, floors, or other cluttered backgrounds, which are transformed to canonical space to be aligned with the ShapeNet model coordinate system. The ground-truths are the corresponding complete and watertight ShapeNet meshes based on Scan2CAD annotations, which are generated with the similar method as above.

We split the training and testing object categories on ScanNet as follows. The 8 training categories are chair, table, sofa, trash bin, cabinet, bookshelf, file cabinet, and monitor; and the 6 novel testing categories are bathtub, lamp, bed, bag, basket, and printer, and each category has more than 50 samples for testing.

Appendix B Baseline Comparison

We use the authors’ original implementations and hyperparameters in all the baselines for fair comparisons.

3d-Epn Dai et al. (2017c)

3D-EPN is a two-stage network, which completes partial 3D scans first and then reconstructs the completed shapes to a higher resolution by retrieving priors from a category-wise shape pool. In our case, priors for novel categories are not accessible, thus, we only compare its 3D Encoder-Predictor Network (the 3D completion model) on our dataset.

Wallace and Hariharan Wallace and Hariharan (2019) (Few-Shot)

This method uses a few-shot learning strategy for single view completion with averaged shape prior for each category. For a fair comparison with other works, we adapt it to a zero-shot learning mechanism here. We pre-compute the averaged shape priors for each training category; during training, we use two voxel encoder modules in parallel for the occupied voxel grids inputs and the averaged shape priors based on the input category; in the testing step, since we cannot provide shape priors for novel categories, we average shape prior from all the training categories, and use this averaged shape prior as input to the prior encoder module, along with the testing samples for shape completion.

IF-Net Chibane et al. (2020)

IF-Net can predict implicit shape representations conditioned on different input modalities. (e.g., voxels, point clouds). We use occupied surface voxel grids as inputs, and use point clouds sampled from watertight ShapeNet meshes as the ground-truths for training and testing. We also normalize the ground-truth meshes from the ScanNet dataset to sample points.

AutoSDF Mittal et al. (2022)

AutoSDF learns latent patch priors using VQ-VAE along with a transformer-based autoregressive model for 3D shape completion, and manually picks the unknown patches during testing. Following their settings, we apply their method by using ground-truth SDFs as the training data; during testing on the ShapeNet data, we choose the patches that have more than 400 voxel grids (each patch has voxel grids) with negative signs as the unknown patches (unseen parts) that need to be generated.

Note that since AutoSDF work focuses on multi-model shape completion and produces multiple output possibilities, we report the performance of only the best prediction among the nine given an oracle to indicate the best (highest IoU value with respect to ground truth).

Furthermore, as there are no absolutely unknown patches for ScanNet scans because of the cluttered environments, we use the pipeline of their single view reconstruction task. We first replace their ResNet in resnet2vq_model with three 3D encoders (the same as 3D-EPN encoders) to extract the encoding features of desired dimensions; then we train this modified model along with the pre-trained pvq_vae_model with partial ScanNet inputs; finally we test our partial ScanNet inputs along with all the pre-trained models: resnet2vq_model, pvq_vae_model, and rand_tf_model.

Appendix C Category-wise Evaluations with Error Bars

Table 7 and Table 8 show the category-wise error bars on ShapeNet and ScanNet respectively; each method is run times to obtain the error bars.

Chamfer Distance () IoU
3D-EPN Dai et al. (2017c) Few-Shot Wallace and Hariharan (2019) IF-Nets Chibane et al. (2020) Auto-SDF Mittal et al. (2022) Ours 3D-EPN Dai et al. (2017c) Few-Shot Wallace and Hariharan (2019) IF-Nets Chibane et al. (2020) Auto-SDF Mittal et al. (2022) Ours
Bag 5.01 8.00 4.77 5.81 3.94 0.738 0.561 0.698 0.563 0.776
Lamp 8.07 15.10 5.70 6.57 4.68 0.472 0.254 0.508 0.391 0.564
Bathtub 4.21 7.05 4.72 5.17 3.78 0.579 0.457 0.550 0.410 0.663
Bed 5.84 10.03 5.34 6.01 4.49 0.584 0.396 0.607 0.446 0.668
Basket 7.90 8.72 4.44 6.70 5.15 0.540 0.406 0.502 0.398 0.610
Printer 5.15 9.26 5.83 7.52 4.63 0.736 0.567 0.705 0.499 0.776
Laptop 3.90 10.35 6.47 4.81 3.77 0.620 0.313 0.583 0.511 0.638
Bench 4.54 8.11 5.03 4.31 3.70 0.483 0.272 0.497 0.395 0.539
Inst- Avg 5.48 9.75 5.37 5.76 4.23 0.582 0.386 0.574 0.446 0.644
Cat- Avg 5.58 9.58 5.29 5.86 4.27 0.594 0.403 0.581 0.452 0.654
Table 7: Quantitative comparisons with state of the art on ShapeNet Chang et al. (2015).
Chamfer Distance () IoU
3D-EPN Dai et al. (2017c) Few-Shot Wallace and Hariharan (2019) IF-Nets Chibane et al. (2020) Auto-SDF Mittal et al. (2022) Ours 3D-EPN Dai et al. (2017c) Few-Shot Wallace and Hariharan (2019) IF-Nets Chibane et al. (2020) Auto-SDF Mittal et al. (2022) Ours
Bag 8.83 9.10 8.96 9.30 8.23 0.537 0.449 0.442 0.487 0.583
Lamp 14.27 11.88 10.16 11.17 9.42 0.207 0.196 0.249 0.244 0.284
Bathtub 7.56 7.77 7.19 7.84 6.77 0.410 0.382 0.395 0.366 0.480
Bed 7.76 9.07 8.24 7.91 7.24 0.478 0.349 0.449 0.380 0.484
Basket 7.74 8.02 6.74 7.54 6.60 0.365 0.343 0.427 0.361 0.455
Printer 8.36 8.30 8.28 9.66 6.84 0.630 0.622 0.607 0.499 0.705
Inst- Avg 8.60 8.83 8.12 8.56 7.38 0.441 0.387 0.426 0.386 0.498
Cat- Avg 9.09 9.02 8.26 8.90 7.52 0.440 0.386 0.426 0.389 0.495
Table 8: Quantitative comparisons with state of the art on ScanNet Dai et al. (2017a).

Appendix D Evaluation on Seen Categories

Table 9 shows the comparisons on seen train categories with state of the art on real-world data from ScanNet Dai et al. (2017a). We evaluate 1060 samples for 7 seen categories including: chair, table, sofa, trash bin, cabinet, bookshelf, and monitor; categories are selected as those which have more than 50 test samples.

Table 9 shows that our performance on seen categories is on par with state of the art, particularly when evaluating category averages, as our learned multiresolution priors maintain robustness across categories. Note that similar to the previous evaluation, AutoSDF results are reported as the best among their nine predictions with the highest IoU value given an oracle to indicate the best choice. Our method thus achieves performance on par with state of the art on seen categories, and notably improves shape completion for unseen categories.

Chamfer Distance () IoU
3D-EPN Dai et al. (2017c) Few-Shot Wallace and Hariharan (2019) IF-Nets Chibane et al. (2020) Auto-SDF Mittal et al. (2022) Ours 3D-EPN Dai et al. (2017c) Few-Shot Wallace and Hariharan (2019) IF-Nets Chibane et al. (2020) Auto-SDF Mittal et al. (2022) Ours
Trash Bin 5.03 5.65 5.23 4.48 4.44 0.61 0.70 0.62 0.66 0.68
Chair 9.99 6.88 7.93 6.00 7.14 0.40 0.46 0.43 0.49 0.45
Bookshelf 4.87 4.33 5.17 4.12 3.80 0.53 0.65 0.58 0.61 0.61
Table 8.74 7.13 10.15 6.72 6.60 0.47 0.50 0.46 0.49 0.54
Cabinet 4.60 4.36 5.64 4.53 4.17 0.76 0.80 0.74 0.78 0.79
Sofa 4.94 4.28 7.87 4.58 4.53 0.69 0.75 0.67 0.72 0.73
Monitor 5.75 4.98 6.39 5.92 4.74 0.52 0.59 0.53 0.49 0.56
Inst Avg 7.94 6.18 7.65 5.68 6.02 0.50 0.56 0.51 0.55 0.55
Cat Avg 6.27 5.37 6.91 5.20 5.06 0.57 0.63 0.58 0.61 0.62
Table 9: Quantitative comparison with state of the art on real-world ScanNet Dai et al. (2017a) shape completion for seen categories. We bold the best results and underline the second best results in the table.

Appendix E Model Architecture Details

Figure 7 details our model architecture. Figure 7 (a), (b) and (c) respectively present the submodule for learning patch priors at resolutions , , and . The network in Figure 7 (d) shows our multi-resolution patching learning stage. Inputs are partial scans and the learnable shape priors, and the outputs are completed shapes. The specifications of encoder and decoder blocks in these models are shown in Figure 8.

Figure 7: Model specifications in our method. (a) represents the patch learning model structure for resolution at ; (b) represents the patch learning model structure for resolution at ; (c) represents the patch learning model structure for resolution at ; (d) represents the multi-resolution model structure. In figure (d), represents the obtained attention map, represents the input local features, and represents the learned prior patch features, where is the resolution for patch learning model, .
Figure 8: Layer specifications in our model. During the process of learning patch priors in a single resolution, we use encoder blocks to encode partial input scans and learnable shape priors to local features, and then use linear blocks to post-process the obtained attention map. The decoder block is used for decoding complete shapes in a multi-resolution patch learning module.

Appendix F Additional Qualitative Results

Figure 9 shows more examples for qualitative results on ShapeNet, and Figure 10 shows more examples for qualitative results on ScanNet scans.

Figure 9: Qualitative comparisons with state of the art on ShapeNet Chang et al. (2015).
Figure 10: Qualitative comparisons with state of the art on ScanNet Dai et al. (2017a).