Joint Embedding of 3D Scan and CAD Objects

08/19/2019 ∙ by Manuel Dahnert, et al. ∙ 3

3D scan geometry and CAD models often contain complementary information towards understanding environments, which could be leveraged through establishing a mapping between the two domains. However, this is a challenging task due to strong, lower-level differences between scan and CAD geometry. We propose a novel approach to learn a joint embedding space between scan and CAD geometry, where semantically similar objects from both domains lie close together. To achieve this, we introduce a new 3D CNN-based approach to learn a joint embedding space representing object similarities across these domains. To learn a shared space where scan objects and CAD models can interlace, we propose a stacked hourglass approach to separate foreground and background from a scan object, and transform it to a complete, CAD-like representation to produce a shared embedding space. This embedding space can then be used for CAD model retrieval; to further enable this task, we introduce a new dataset of ranked scan-CAD similarity annotations, enabling new, fine-grained evaluation of CAD model retrieval to cluttered, noisy, partial scans. Our learned joint embedding outperforms current state of the art for CAD model retrieval by 12 in instance retrieval accuracy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The capture and reconstruction of real-world 3D scenes has seen significant progress in recent years, driven by increasing availability of commodity RGB-D sensors such as the Microsoft Kinect or Intel RealSense. State-of-the-art 3D reconstruction approaches can achieve impressive reconstruction fidelity with robust tracking [19, 15, 21, 34, 7, 9]

. Such 3D reconstructions have now begun to drive forward 3D scene understanding with the recent availability of annotated reconstruction datasets 

[8, 3]. With the simultaneous availability of synthetic CAD model datasets [4], we have an opportunity to drive forward both 3D scene understanding and geometric reconstruction.

3D models of scanned real-world objects as well as synthetic CAD models of shapes contain significant information about understanding environments, often in a complementary fashion. Where CAD models often comprise relatively simple, clean, compact geometry, real-world objects are often more complex, and scanned real-world object geometry is then more complex, as well as noisy and incomplete. It is thus very informative to establish mappings between the two domains – for instance, to visually transform scans to CAD representations, or transfer learned semantic knowledge from CAD models to a real-world scan. Such a semantic mapping is difficult to obtain due to the lack of exact matches between synthetic models and real-world objects and these strong, low-level geometric differences.

Current approaches towards retrieving CAD models representative of scanned objects thus focus on the task of retrieving a CAD model of the correct object class category [27, 8, 14, 25], without considering within-class similarities or rankings. In contrast, our approach learns a joint embedding space of scan and CAD object geometry where similar objects from both domains lie close together as shown in Fig. 1. To this end, we introduce a new 3D CNN based approach to learn a semantically mixed embedding space as well as a dataset of 5102 scan-CAD ranked similarity annotations. Using this dataset of scan-CAD similarity, we can now fully evaluate CAD model retrieval, with benchmark evaluation of retrieval accuracy as well as ranking ability. To learn a joint embedding space, our model takes a stacked hourglass approach of a series of encoder-decoders: first learning to disentangle a scan object from its background clutter, then transforming the partial scan object to a complete object geometry, and finally learning a shared embedding with CAD models through a triplet loss. This enables scan and CAD object geometry into a shared space and outperforms state-of-the-art CAD model retrieval approaches by in instance retrieval accuracy.

In summary, we make the following contributions:

  • We propose a novel stacked hourglass approach leveraging a triplet loss to learn a joint embedding space between CAD models and scan object geometry.

  • We introduce a new dataset of ranked scan-CAD object similarities, establishing a benchmark for CAD model retrieval from an input scan object. For this task, we propose fine-grained evaluation scores for both retrieval and ranking.

2 Related Work

3D Shape Descriptors

Characterizations of 3D shapes by compact feature descriptors enable a variety of tasks in shape analysis such as shape matching, retrieval, or organization. Shape descriptors have thus seen a long history in geometry processing. Descriptors for characterizing 3D shapes have been proposed leveraging handcrafted features based on lower-level geometric characteristics such as volume, distance, or curvature [23, 22, 11, 28, 30], or higher-level characteristics such as topology [13, 5, 29]. Characterizations in the form of 2D projections of the 3D shapes have also been proposed to describe the appearance and geometry of a shape [6]

. Recently, with advances in deep neural networks for 3D data, neural networks trained for point cloud or volumetric shape classification have also been used to provide feature descriptors for 3D shapes

[27, 26].

CAD Model Retrieval for 3D Scans

CAD model retrieval to RGB-D scan data has been increasingly studied with the recent availability of large-scale datasets of real-world [8, 3] and synthetic [4] 3D objects. The SHREC challenges [14, 25] for CAD model retrieval to real-world scans of objects have become very popular in this context. Due to lack of ground truth data for similarity of CAD models to scan objects, CAD model retrieval in this context is commonly evaluated using the class categories as a coarse proxy for similarity; that is, a retrieved model is considered to be a correct retrieval if the category matches that of the query scan object. We propose a finer-grained evaluation for the task of CAD model retrieval for a scan object with our new Scan-CAD Object Similarity dataset and benchmark.

Multi-modal Embeddings

Embedding spaces across different data modalities have been used for various computer vision tasks, such as establishing relationships between image and language 

[32, 33], or learning similarity between different image domains such as photos and product images [2]. These cross-domain relationships have been shown to aid tasks such as object detection [24, 18]. More recently, Herzog et al. proposed an approach to relate 3D models, keywords, images, and sketches [12]. Li et al. also introduced a CNN-based approach to learn a shared embedding space between CAD models and images, leveraging a CNN to map images into a pre-constructed feature space of CAD model similarity [16]. Our approach also leverages a CNN to construct a model which can learn a joint embedding between scan objects and CAD models in an end-to-end fashion, learning to become invariant to differences in partialness or geometric noise.

Figure 2: Our network architecture to construct a joint embedding between scan and CAD object geometry. The architecture is designed in a stacked hourglass fashion, with a series of hourglass encoder-decoders to transform a scan input to a more CAD-like representation, before mapping the features into an embedding space with a triplet loss. The first hourglass (blue) segments a scan object from its background clutter, the second hourglass (green) predicts the complete geometry for the segmented object, from which the final feature encoding is computed (yellow); CAD object features are computed with the same final encoder. Note that layers are denoted with parameters with number of output channels , kernel size

, stride

, and padding

. Lighter colored layers denote residual blocks, darker colored layers denote a convolutional layer.

3 Method Overview

Our method learns a shared embedding space between real-world scans of objects and CAD models, where semantically similar scan and CAD objects lie near each other, with scan and CAD objects mixed together, invariant to lower-level geometric differences (partialness, noise, etc).

We represent both scan and CAD objects by binary grids representing voxel occupancy, and design a 3D convolutional neural network to encourage scan objects and CAD objects to map into a shared embedding space. Our model is thus structured in a stacked hourglass 

[20] fashion, designed to transform scan objects to a more CAD-like representation before mapping them into this joint space.

The first hourglass learns to segment the scan geometry into object and background clutter, using an encoder with two decoders trained to reconstruct foreground and background, respectively. The segmented foreground then leads to the next hourglass, composed of an encoder-decoder trained to reconstruct the complete geometry of the segmented but partial scan object. This helps to disentangle confounding factors like clutter and partialness of scanned objects before mapping them into a shared space with CAD objects. Here, the completed scan is then input to an encoder to train a latent feature vector which maps into this embedding space, by constraining the latent space to match that of a CAD encoder on a matching CAD object and be far away from the encoder for a non-matching CAD object.

This enables learning of a joint embedding space where semantically similar CAD objects and scan objects lie mixed together. With this learned shared embedding space, we can enable applications such as much finer-grained CAD model retrieval to scan objects than previously attainable. To this end, we demonstrate our joint scan-CAD embedding in the context of CAD model retrieval, introducing a Scan-CAD Object Similarity benchmark and evaluation scores for this task.

4 Learning a Joint Scan-CAD Embedding

4.1 Network Architecture

Our network architecture is shown in Fig. 2. It is an end-to-end, fully-convolutional 3D neural network designed to disentangle lower-level geometric differences between scan objects and CAD models. During training, we take as input a scan object along with a corresponding CAD model and a dissimilar CAD model , each represented by its binary occupancy in a volumetric grid. At test time, we use the learned feature extractors for scan or CAD objects to compute a feature vector in the joint embedding space.

The model is composed as a stacked hourglass of two encoder-decoders followed by a final encoder. The first two hourglass components focus on transforming a scan object to a more CAD-like representation to encourage the joint embedding space to focus on higher-level semantic and structural similarities between scan and CAD than lower-level geometric differences.

The first hourglass is thus designed to segment a scan object from nearby background clutter (e.g., floor, wall, other objects), and is composed of an encoder and two decoders (one for the foreground scan object, one for the background). The encoder employs an initial convolution followed by a series of residual blocks, and a final convolution layer resulting in a -dimensional latent feature space. This feature is then split in half; the first half feeds into a decoder which reconstructs the segmented scan object from background, and the second half to a decoder which reconstructs the background clutter of the input scan geometry. The decoders are structured symmetrically to the encoder (each using half the feature channels). For predicted scan object geometry and background geometry , we train with a proxy loss for reconstructing segmented scan object and background clutter, respectively, as occupancy grids.

The second hourglass takes the segmented scan object and aims to generate the complete geometry of the object, as real-world scans often result in partially observed geometry. This is structured in encoder-decoder fashion, where the encoder and decoder are structured symmetrically to the decoders of the first segmentation hourglass. We then employ a proxy loss on the completion as an occupancy grid: , for completion prediction and CAD model corresponding to the scan object.

The final encoder aims to learn the joint scan-CAD embedding space. This is formulated as a triplet loss:

where with representing the scan segmentation, the scan completion, and an encoder structured symmetrically to the encoder of which produces a feature vector of size . is an encoder structured identically to which computes the feature vector for a CAD occupancy grid. For all our experiments, all losses are weighted equally and we use Euclidean distance and a margin of .

4.2 Network Training

We train our model end-to-end from scratch. For training data, we use the paired scan and CAD models ( and ), from Scan2CAD [1], which provides CAD model alignments from ShapeNet [4] onto the real-world scans of ScanNet [8]. For the non-matching CAD models

, we randomly sample models from Scan2CAD from different class categories. After every epoch we re-sample new negatives.

We train our model using an Adam optimizer with a batch size of 128 and an initial learning rate of , which is decayed by every iterations. Our model is trained for iterations ( 1 day) on a single Nvidia GTX 1080Ti.

Method trash bin bathtub bed bookshelf cabinet chair display file sofa table class avg inst (k=10) inst (k=50)
FPFH [28] 0.09 0.06 0.01 0.03 0.02 0.05 0.08 0.02 0.02 0.01 0.03 0.02 0.04
SHOT [30] 0.17 0.14 0.06 0.02 0.03 0.12 0.13 0.08 0.01 0.05 0.08 0.04 0.07
PointNet [26] 0.10 0.08 0.18 0.08 0.03 0.07 0.06 0.12 0.04 0.05 0.06 0.05 0.13
3DCNN [27] 0.29 0.31 0.32 0.31 0.21 0.14 0.29 0.28 0.29 0.18 0.22 0.20 0.33
Ours (no seg, no cmpl) 0.14 0.13 0.23 0.11 0.07 0.15 0.14 0.28 0.19 0.18 0.16 0.14 0.22
Ours (no cmpl) 0.24 0.32 0.26 0.28 0.13 0.21 0.44 0.24 0.19 0.25 0.24 0.21 0.31
Ours (no seg) 0.50 0.53 0.52 0.51 0.48 0.44 0.51 0.53 0.47 0.50 0.49 0.48 0.49
Ours (no triplet) 0.51 0.48 0.45 0.22 0.42 0.34 0.25 0.50 0.28 0.38 0.36 0.34 0.42
Ours (w/o end-to-end) 0.42 0.46 0.46 0.35 0.42 0.35 0.33 0.51 0.34 0.41 0.39 0.37 0.44
Ours 0.51 0.52 0.50 0.51 0.51 0.48 0.50 0.55 0.51 0.49 0.50 0.49 0.50
Table 1: Evaluation of the joint scan-CAD embedding space. We compare our learned scan-CAD feature space to those constructed from features computed through both handcrafted and learned shape descriptors. We evaluate the confusion between scan and CAD, where reflects a perfect confusion.

5 Scan-CAD Object Similarity Benchmark

Figure 3: Annotation interface for obtaining ranked similarity of CAD models to a scan query. A user selects and ranks up to 3 CAD models from a pool of 6 proposed models.

Our learned joint embedding space between scan and CAD object geometry enables characterization of these objects at higher-level semantic and structural similarity. This allows us to formulate applications like CAD model retrieval in a more comprehensive fashion, in particular in contrast to previous approaches which evaluate retrieval by the class accuracy of the retrieved object [14, 27, 25]. We aim to characterize retrieval through finer-grained object similarity than class categories. Thus, we propose a new Scan-CAD Object Similarity dataset and benchmark for CAD model retrieval.

To construct our Scan-CAD Object Similarity dataset, we develop an intuitive annotation web interface designed to measure scan-CAD similarities, inspired by [17]. As shown in Fig. 3, the geometry of a query scan model is shown, along with a set of CAD models. A user then selects up to similar CAD models from the proposed set, in order of similarity to the query scan geometry, resulting in ranked scan-CAD similarity annotations. Users are instructed to measure the similarity in terms of object geometry. Initially, the models are displayed in a canonical pose, but the user can rotate, translate or zoom each model individually to inspect it in closer detail. As scan objects can occasionally be very partial, we also provide an option to click on a ‘hint’ which shows a color image of the object with a bounding box around it, in order to help identify the object if the segmented geometry is insufficient.

To collect these scan-CAD similarity annotations, we use segmented scan objects from the ScanNet dataset [8], which provides labeled semantic instance segmentation over the scan geometry. CAD models are proposed from ShapeNetCore [4]. The CAD model proposals are sampled leveraging the annotations from the Scan2CAD dataset [1], which provides CAD model alignments for unique ShapeNetCore models to objects in

ScanNet scans. We propose CAD models for a scan query by sampling in the latent space of an autoencoder trained on ShapeNetCore using the feature vector of the associated CAD model from the Scan2CAD dataset. In the latent space, we select the 30 nearest neighbors of the associated CAD model and randomly select

to be proposed to the user. This enables a description of ranked similarity for a scan object to several CAD models, which we can then use for fine-grained evaluation of CAD model retrieval.

Dataset Statistics

To construct our Scan-CAD Object Similarity dataset and benchmark, we employed three university students as annotators, and trained them to become familiar with the interface and to ensure high-quality annotations for our task. Our final dataset is composed of 5102 annotations covering 31 different class categories (derived from ShapeNet classes). These cover 3979 unique scan objects and 7650 unique CAD models.

Method trash bin bathtub bed bookshelf cabinet chair display file sofa table other class avg inst avg
FPFH [28] 0.02 0.07 0.00 0.00 0.00 0.18 0.03 0.00 0.07 0.02 0.03 0.04 0.08
SHOT [30] 0.00 0.20 0.09 0.00 0.01 0.06 0.12 0.00 0.07 0.02 0.03 0.05 0.04
PointNet [26] 0.38 0.00 0.61 0.23 0.04 0.43 0.37 0.17 0.09 0.13 0.07 0.23 0.29
3DCNN [27] 0.52 0.33 0.48 0.46 0.14 0.28 0.38 0.33 0.17 0.18 0.32 0.33 0.31
Ours (no seg, no cmpl) 0.06 0.00 0.15 0.04 0.00 0.47 0.30 0.00 0.20 0.13 0.04 0.13 0.23
Ours (no cmpl) 0.13 0.07 0.15 0.12 0.04 0.37 0.38 0.00 0.15 0.26 0.09 0.16 0.24
Ours (no seg) 0.14 0.07 0.24 0.13 0.15 0.40 0.32 0.17 0.15 0.21 0.13 0.19 0.26
Ours (no triplet) 0.03 0.13 0.39 0.04 0.11 0.07 0.08 0.00 0.13 0.09 0.04 0.10 0.08
Ours (w/o end-to-end) 0.42 0.27 0.48 0.07 0.15 0.42 0.27 0.25 0.35 0.21 0.32 0.29 0.32
Ours 0.50 0.60 0.42 0.19 0.26 0.55 0.45 0.25 0.33 0.32 0.43 0.39 0.43
Table 2: Top-1 retrieval accuracy for CAD model retrieval on the test split of our Scan-CAD Object Similarity benchmark.

5.1 Benchmark Evaluation

We also introduce a new benchmark to evaluate both a scan-CAD embedding space as well as CAD model retrieval. To evaluate the learned embedding space, we measure a confusion score: for each object embedding feature, we compute the percentage of scan neighbors and the percentage of CAD neighbors for its nearest neighbors. The final confusion score is then

.

This describes how well the embedding space mixes the two domains, agnostic to the lower-level geometric differences. Note that we evaluate this confusion score on a set of embedded scan and CAD features with a 1-to-1 mapping between the scan and CAD objects, and use . A confusion of means a perfect balance between scan and CAD objects in the local neighborhood around an object.

To evaluate the semantic embedding quality, we propose two scores for scan-CAD similarity in the context of CAD model retrieval: retrieval accuracy and ranking quality. Here, we employ the scan-CAD similarity annotations of our Scan-CAD Object Similarity dataset. For both retrieval accuracy and ranking quality, we consider an input query scan, and retrieval from the set of proposed CAD models supplemented with additional randomly selected CAD models of different class from the query (in order to reflect a diverse set of models for retrieval). For retrieval accuracy, we evaluate whether the top- retrieved model lies in the set of models annotated as similar to the query scan. We also evaluate the ranking; that is, for a ground truth annotation with rank-annotated similar models (), we take the top predicted models and evaluate the number of models predicted in the correct rank divided by .

Note that for the task of CAD model retrieval, we consider scan objects in the context of potential background clutter from scanning; that is we assume a given object detection as input, but not object segmentation.

Figure 4: Our CAD model retrieval results, visualizing the top retrieved models using our joint embedding space for various scan and CAD queries. Our feature space learns to mix together scan and CAD objects in a semantically meaningful fashion.
Figure 5: CAD model retrieval results (top-1) for various scan queries (from left to right: piano, table, guitar, trash bin, bed, lamp, dresser). Our approach to a joint embedding of scan and CAD can retrieve similar models at a finer-grained level than state-of-the-art handcrafted (FPFH [28], SHOT [30]) and learned (PointNet [26], 3DCNN [27]) 3D object descriptors.

6 Results and Evaluation

We evaluate both the quality of our learned scan-CAD embedding space as well as its application to the task of CAD model retrieval for scan objects using the confusion, retrieval accuracy, and ranking quality scores proposed in Section 5.1. Additionally, in Table 3, we evaluate on a coarser level retrieval score based on whether the retrieved model’s class is correct, which is the basis of retrieval evaluation used in previous approaches [27, 14, 25]. We compare our method with both state-of-the-art handcrafted shape descriptors FPFH [28] and SHOT [30] as well as learned shape descriptors from PointNet [26] and the volumetric 3D CNN from [27]. We evaluate FPFH and SHOT on point clouds uniformly sampled from the mesh surface of the scans and CAD objects, with all meshes normalized to lie within a unit sphere. We compute a single shape descriptor for the entire object by using the centroid of the mesh and a radius of 1.

We train PointNet on points uniformly sampled from the scan and CAD objects for object classification, and extract the -dimensional feature vector before the final classification layer. For the volumetric 3D CNN of [27], we train on occupancy grids of both scan objects and CAD models, and extract the -dimensional feature vector before the final classification layer.

Learned joint embedding space.

In Table 1, we show that our model is capable of learning a very mixed embedding space, where scan and CAD objects lie about as close to each other as they do to other objects from the same domain, while maintaining semantic structure in the space. In contrast, both previous handcrafted and learned shape descriptors result in segregated feature spaces with scan objects lying much closer to scan than CAD objects and vice versa, see Fig. 6. Our learned scan-CAD embedding space is shown in Fig. 1, visualized by t-SNE. We also show the top-4 nearest neighbors for various queries from our established joint embedding space in Fig. 4, retrieving objects from both domains while maintaining semantic structure.

Comparison to alternative CAD model retrieval approaches.

Using our learned feature embedding space for scan and CAD objects, we evaluate it for the task of CAD model retrieval to scan object geometry. Tables 2 and 4 show our CAD retrieval quality in comparison to alternative 3D object descriptors, using our benchmark evaluation. Fig. 5 shows the top-1 CAD retrievals for various scan queries. Our learned features from the joint embedding space achieve notably improved retrieval on both a class accuracy-based retrieval score (Table 3) as well as our proposed finer-grained retrieval evaluation scores.

Method Top-1 Top-5
FPFH [28] 0.14 0.13
SHOT [30] 0.07 0.08
PointNet [26] 0.49 0.45
3DCNN [27] 0.57 0.47
Ours 0.68 0.62
Table 3: Evaluation of CAD model retrieval by Top-1 and Top-5 using category-based evaluation of retrieval accuracy.

How much do the segmentation and completion steps matter?

Tables 1, 2, and 4 show that the proxy segmentation and completion steps in transforming scan object geometry to a more CAD-like representation are important towards learning an effective joint embedding space as well as for CAD model retrieval, with performance improving by and with segmentation and completion, respectively, for our retrieval accuracy (class average). Additionally, we show that end-to-end training significantly improves the learned embedding space.

Method trash bin bathtub bed bookshelf cabinet chair display file sofa table other class avg inst avg
FPFH [28] 0.01 0.09 0.01 0.00 0.00 0.06 0.01 0.00 0.03 0.01 0.02 0.02 0.03
SHOT [30] 0.00 0.06 0.01 0.00 0.01 0.03 0.02 0.00 0.04 0.01 0.01 0.02 0.02
PointNet [26] 0.22 0.03 0.24 0.15 0.04 0.16 0.11 0.00 0.02 0.04 0.05 0.10 0.12
3DCNN [27] 0.23 0.03 0.31 0.16 0.07 0.11 0.12 0.13 0.09 0.07 0.12 0.12 0.13
Ours (no seg, no cmpl) 0.05 0.00 0.08 0.03 0.01 0.17 0.14 0.00 0.10 0.04 0.04 0.06 0.10
Ours (no cmpl) 0.08 0.00 0.06 0.04 0.02 0.15 0.12 0.06 0.06 0.11 0.05 0.07 0.10
Ours (no seg) 0.08 0.06 0.12 0.08 0.09 0.14 0.09 0.06 0.07 0.07 0.04 0.08 0.10
Ours (no triplet) 0.01 0.06 0.13 0.03 0.04 0.03 0.02 0.06 0.04 0.04 0.05 0.05 0.04
Ours (w/o end-to-end) 0.14 0.18 0.12 0.04 0.06 0.18 0.14 0.13 0.16 0.08 0.12 0.12 0.13
Ours 0.29 0.24 0.19 0.08 0.12 0.19 0.14 0.19 0.15 0.10 0.09 0.16 0.16
Table 4: Ranking quality of CAD model retrieval on the test split of our Scan-CAD Object Similarity benchmark.

What is the impact of the triplet loss formulation?

Using a triplet loss to train the feature embedding in a shared space significantly improves the construction of the embedding space, as well as CAD model retrieval from the space. In Tables 1, 2, and 4, we show a comparison to training our model using only positive scan-CAD associations rather than both positive and negative samples; the triplet constraint of both positive and negative examples produces a much more globally structured embedding space.

How robust is the model to rotations?

To achieve robustness to rotations for scan queries, we can train our method with rotation augmentation, achieving similar performance for arbitrarily rotated scan inputs (0.42 instance average retrieval accuracy, 0.16 instance average ranking quality). See the appendix for more detail.

Figure 6: Comparison of latent spaces visualized by t-SNE. Filled triangles represent scan objects, circles represent CAD models. While FPFH, SHOT, and PointNet result in almost entirely disjoint clusters, 3DCNN is able to co-locate the classes of both domains next to each other, but does not confuse them. Our approach learns an embedding space where scan and CAD objects mix together but remain semantically structured.

6.1 Limitations

While our approach learns an effective embedding space between scan and CAD object geometry, there are still several important limitations. For instance, we only consider the geometry of the objects in both scan and CAD domain; considering color information would potentially be another powerful signal for joint embedding or CAD model retrieval. The geometry is also represented as an occupancy grid, which can limit resolution of fine detail. For the CAD model retrieval task, we currently assume a given object detection, and while 3D object detection has recently made significant progress, detection and retrieval would likely benefit from an end-to-end formulation.

7 Conclusion

In this paper, we have presented a 3D CNN-based approach to jointly map scan and CAD object geometry into a shared embedding space. Our approach leverages a stacked hourglass architecture combined with a triplet loss to transform scan object geometry to a more CAD-like representation, effectively learning a joint feature embedding space. We show the advantages of our learned feature space for the task of CAD model retrieval, and propose several new evaluation scores for finer-grained retrieval evaluation, with our approach outperforming state-of-the-art handcrafted and learned methods on all evaluation scores. We hope that learning such a joint scan-CAD embedding space will not only open new possibilities for CAD model retrieval but also potentially enable further perspective on mapping or reciprocal transfer of knowledge between the two domains.

Acknowledgements

This work is supported by Occipital, the ERC Starting Grant Scan2CAD (804724), a Google Faculty Award, the ZD.B., ONR MURI grant N00014-13-1-0341, NSF grant IIS-1763268, as well as a Vannevar Bush Fellowship. We would also like to thank the support of the TUM-IAS, funded by the German Excellence Initiative and the European Union Seventh Framework Programme under grant agreement n° 291763, for the TUM-IAS Rudolf Mößbauer Fellowship and Hans Fischer Senior Faculty Fellowship (Focus Group Visual Computing).

References

  • [1] A. Avetisyan, M. Dahnert, A. Dai, M. Savva, A. X. Chang, and M. Nießner (2019) Scan2CAD: learning cad model alignment in rgb-d scans. In

    Proc. Computer Vision and Pattern Recognition (CVPR), IEEE

    ,
    Cited by: §4.2, §5.
  • [2] S. Bell and K. Bala (2015)

    Learning visual similarity for product design with convolutional neural networks

    .
    ACM Transactions on Graphics (TOG) 34 (4), pp. 98. Cited by: §2.
  • [3] A. X. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y. Zhang (2017) Matterport3D: learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV). Cited by: §1, §2.
  • [4] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu (2015) ShapeNet: An Information-Rich 3D Model Repository. Technical report Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago. Cited by: §1, §2, §4.2, §5.
  • [5] D. Chen and M. Ouhyoung (2002) A 3d object retrieval system based on multi-resolution reeb graph. In Proc. of Computer Graphics Workshop, Vol. 16. Cited by: §2.
  • [6] D. Chen, X. Tian, Y. Shen, and M. Ouhyoung (2003) On visual similarity based 3d model retrieval. In Computer graphics forum, Vol. 22, pp. 223–232. Cited by: §2.
  • [7] S. Choi, Q. Zhou, and V. Koltun (2015) Robust reconstruction of indoor scenes. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5556–5565. Cited by: §1.
  • [8] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) ScanNet: richly-annotated 3D reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, Cited by: §1, §1, §2, §4.2, §5.
  • [9] A. Dai, M. Nießner, M. Zollhöfer, S. Izadi, and C. Theobalt (2017) BundleFusion: real-time globally consistent 3d reconstruction using on-the-fly surface reintegration. ACM Transactions on Graphics (TOG) 36 (3), pp. 24. Cited by: §1.
  • [10] A. Dai, C. R. Qi, and M. Nießner (2017) Shape completion using 3d-encoder-predictor cnns and shape synthesis. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, Cited by: Appendix A, Table 5, Table 7, Table 8, Table 9.
  • [11] R. Gal, A. Shamir, and D. Cohen-Or (2007) Pose-oblivious shape signature. IEEE transactions on visualization and computer graphics 13 (2), pp. 261–271. Cited by: §2.
  • [12] R. Herzog, D. Mewes, M. Wand, L. Guibas, and H. Seidel (2015) LeSSS: learned shared semantic spaces for relating multi-modal representations of 3d shapes. In Computer Graphics Forum, Vol. 34, pp. 141–151. Cited by: §2.
  • [13] M. Hilaga, Y. Shinagawa, T. Kohmura, and T. L. Kunii (2001)

    Topology matching for fully automatic similarity estimation of 3d shapes

    .
    In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pp. 203–212. Cited by: §2.
  • [14] B. Hua, Q. Truong, M. Tran, Q. Pham, A. Kanezaki, T. Lee, H. Chiang, W. Hsu, B. Li, Y. Lu, et al. SHREC’17: rgb-d to cad retrieval with objectnn dataset. Cited by: §1, §2, §5, §6.
  • [15] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, et al. (2011) KinectFusion: real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pp. 559–568. Cited by: §1.
  • [16] Y. Li, H. Su, C. R. Qi, N. Fish, D. Cohen-Or, and L. J. Guibas (2015) Joint embeddings of shapes and images via cnn image purification. ACM Trans. Graph.. Cited by: §2.
  • [17] T. Liu, A. Hertzmann, W. Li, and T. Funkhouser (2015-08) Style compatibility for 3D furniture models. ACM Transactions on Graphics (Proc. SIGGRAPH) 34 (4). Cited by: §5.
  • [18] F. Massa, B. C. Russell, and M. Aubry (2016) Deep exemplar 2d-3d detection by adapting from real to rendered views. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6024–6033. Cited by: §2.
  • [19] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon (2011) KinectFusion: real-time dense surface mapping and tracking. In Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on, pp. 127–136. Cited by: §1.
  • [20] A. Newell, K. Yang, and J. Deng (2016) Stacked hourglass networks for human pose estimation. In Computer Vision – ECCV 2016, pp. 483–499. Cited by: §3.
  • [21] M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger (2013) Real-time 3d reconstruction at scale using voxel hashing. ACM Transactions on Graphics (TOG). Cited by: §1.
  • [22] R. Ohbuchi, T. Minamitani, and T. Takei (2003) Shape-similarity search of 3d models by using enhanced shape functions. In Proceedings of Theory and Practice of Computer Graphics, 2003., pp. 97–104. Cited by: §2.
  • [23] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin (2002) Shape distributions. ACM Transactions on Graphics (TOG) 21 (4), pp. 807–832. Cited by: §2.
  • [24] X. Peng, B. Sun, K. Ali, and K. Saenko (2015) Learning deep object detectors from 3d models. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1278–1286. Cited by: §2.
  • [25] Q. Pham, M. Tran, W. Li, S. Xiang, H. Zhou, W. Nie, A. Liu, Y. Su, M. Tran, N. Bui, et al. SHREC’18: rgb-d object-to-cad retrieval. Cited by: §1, §2, §5, §6.
  • [26] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017)

    Pointnet: deep learning on point sets for 3d classification and segmentation

    .
    Proc. Computer Vision and Pattern Recognition (CVPR), IEEE 1 (2), pp. 4. Cited by: §2, Table 1, Figure 5, Table 2, Table 3, Table 4, §6.
  • [27] C. R. Qi, H. Su, M. Nießner, A. Dai, M. Yan, and L. Guibas (2016) Volumetric and multi-view cnns for object classification on 3d data. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, Cited by: §1, §2, Table 1, Figure 5, Table 2, §5, Table 3, Table 4, §6, §6.
  • [28] R. B. Rusu, N. Blodow, and M. Beetz (2009) Fast point feature histograms (fpfh) for 3d registration. In Robotics and Automation, 2009. ICRA’09. IEEE International Conference on, pp. 3212–3217. Cited by: §2, Table 1, Figure 5, Table 2, Table 3, Table 4, §6.
  • [29] H. Sundar, D. Silver, N. Gagvani, and S. Dickinson (2003) Skeleton based shape matching and retrieval. In 2003 Shape Modeling International., pp. 130–139. Cited by: §2.
  • [30] F. Tombari, S. Salti, and L. Di Stefano (2010) Unique signatures of histograms for local surface description. In Computer Vision – ECCV 2010, K. Daniilidis, P. Maragos, and N. Paragios (Eds.), Berlin, Heidelberg, pp. 356–369. External Links: ISBN 978-3-642-15558-1 Cited by: §2, Table 1, Figure 5, Table 2, Table 3, Table 4, §6.
  • [31] W. Wang, R. Yu, Q. Huang, and U. Neumann (2018) Sgpn: similarity group proposal network for 3d point cloud instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2569–2578. Cited by: Appendix A, Table 5.
  • [32] J. Weston, S. Bengio, and N. Usunier (2010) Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning 81 (1), pp. 21–35. Cited by: §2.
  • [33] J. Weston, S. Bengio, and N. Usunier (2011) Wsabie: scaling up to large vocabulary image annotation. In

    Twenty-Second International Joint Conference on Artificial Intelligence

    ,
    Cited by: §2.
  • [34] T. Whelan, S. Leutenegger, R. F. Salas-Moreno, B. Glocker, and A. J. Davison (2015) ElasticFusion: dense slam without a pose graph. Proc. Robotics: Science and Systems, Rome, Italy. Cited by: §1.

Appendix A Additional Quantitative Studies

We provide several additional quantitative experiments evaluating robustness against rotation, as well as the performance of the object segmentation and completion from our stacked hourglass model.

Robustness to rotations

To achiever robustness to potential rotations for input scan queries, we train our method with rotation augmentation around the up axis. During training, we rotate the initial partial and cluttered scan object as well as the positive and negative CAD model with the same random rotation. During evaluation of CAD model retrieval, we embed CAD models into the embedding space using uniform rotations for each CAD model; for an input scan query, we then find the closest CAD embedding. We train for 160k iterations and a triplet margin of 0.1. Table 6 shows the results of CAD model retrieval while testing on randomly rotated scan object inputs. With this rotation augmentation, we can achieve performance on par with the case of canonically oriented objects while testing on arbitrarily rotated scan inputs: 0.42 in instance average retrieval accuracy and 0.16 in instance average ranking quality.

Method IoU
[A] SGPN [31] 0.10
[B] Segmentation(Ours) 0.36
[C] Segmentation(Ours) + 3D-EPN [10] 0.48
[D] Segmentation(Ours) + Completion(Ours) 0.53
Table 5: Evaluation (IoU) of our segmentation and completion to SGPN [31] and 3D-EPN [10], respectively.
Method trash bin bathtub bed bookshelf cabinet chair display file sofa table other class avg inst avg
Ours with rotations 0.35 0.47 0.45 0.18 0.30 0.56 0.45 0.08 0.41 0.41 0.26 0.36 0.42
Table 6: Evaluation of CAD model retrieval by retrieval accuracy on our Scan-CAD Object Similarity benchmark.

What is the performance of the segmentation and completion?

In Table 5, we evaluate the performance of the first and second hourglass with Intersection over Union (IoU) between the predicted and ground truth binary occupancy grid. We compare our model against SGPN [31], a point cloud based segmentation method, and 3D-EPN [10], a voxel-based object completion network. For evaluation, we then convert all outputs to occupancy grids to compute the final IoU scores.

Additionally, we evaluate our stacked hourglass model, replacing our completion encoder-decoder with the model of 3D-EPN, trained end-to-end to learn a joint scan-CAD embedding. Tables 7, 8, and 9 show that our model achieves notably better performance in embedding space confusion as well as CAD model retrieval and ranking than the version using 3D-EPN.

Method trash bin bathtub bed bookshelf cabinet chair display file sofa table class avg inst (k=10) inst (k=50)
Ours (no seg, no cmpl) 0.14 0.13 0.23 0.11 0.07 0.15 0.14 0.28 0.19 0.18 0.16 0.14 0.22
Ours (no cmpl) 0.24 0.32 0.26 0.28 0.13 0.21 0.44 0.24 0.19 0.25 0.24 0.21 0.31
Ours (no seg) 0.50 0.53 0.52 0.51 0.48 0.44 0.51 0.53 0.47 0.50 0.49 0.48 0.49
Ours (no triplet) 0.51 0.48 0.45 0.22 0.42 0.34 0.25 0.50 0.28 0.38 0.36 0.34 0.42
Ours (3D-EPN [10] for cmpl) 0.45 0.51 0.49 0.46 0.50 0.33 0.40 0.53 0.47 0.45 0.43 0.42 0.47
Ours (w/o end-to-end) 0.42 0.46 0.46 0.35 0.42 0.35 0.33 0.51 0.34 0.41 0.39 0.37 0.44
Ours 0.51 0.52 0.50 0.51 0.51 0.48 0.50 0.55 0.51 0.49 0.50 0.49 0.50
Table 7: Evaluation of the joint scan-CAD embedding space. We compare our learned scan-CAD feature space to those constructed from features computed through both handcrafted and learned shape descriptors. We evaluate the confusion between scan and CAD, where reflects a perfect confusion.
Method trash bin bathtub bed bookshelf cabinet chair display file sofa table other class avg inst avg
Ours (no seg, no cmpl) 0.06 0.00 0.15 0.04 0.00 0.47 0.30 0.00 0.20 0.13 0.04 0.13 0.23
Ours (no cmpl) 0.13 0.07 0.15 0.12 0.04 0.37 0.38 0.00 0.15 0.26 0.09 0.16 0.24
Ours (no seg) 0.14 0.07 0.24 0.13 0.15 0.40 0.32 0.17 0.15 0.21 0.13 0.19 0.26
Ours (no triplet) 0.03 0.13 0.39 0.04 0.11 0.07 0.08 0.00 0.13 0.09 0.04 0.10 0.08
Ours (3D-EPN [10] for cmpl) 0.41 0.33 0.42 0.21 0.19 0.49 0.40 0.08 0.20 0.30 0.31 0.30 0.37
Ours (w/o end-to-end) 0.42 0.27 0.48 0.07 0.15 0.42 0.27 0.25 0.35 0.21 0.32 0.29 0.32
Ours 0.50 0.60 0.42 0.19 0.26 0.55 0.45 0.25 0.33 0.32 0.43 0.39 0.43
Table 8: Evaluation of CAD model retrieval by top-1 retrieval accuracy on the test split of our Scan-CAD Object Similarity benchmark.
Method trash bin bathtub bed bookshelf cabinet chair display file sofa table other class avg inst avg
Ours (no seg, no cmpl) 0.05 0.00 0.08 0.03 0.01 0.17 0.14 0.00 0.10 0.04 0.04 0.06 0.10
Ours (no cmpl) 0.08 0.00 0.06 0.04 0.02 0.15 0.12 0.06 0.06 0.11 0.05 0.07 0.10
Ours (no seg) 0.08 0.06 0.12 0.08 0.09 0.14 0.09 0.06 0.07 0.07 0.04 0.08 0.10
Ours (no triplet) 0.01 0.06 0.13 0.03 0.04 0.03 0.02 0.06 0.04 0.04 0.05 0.05 0.04
Ours (3D-EPN [10] for cmpl) 0.17 0.18 0.19 0.09 0.12 0.17 0.15 0.06 0.10 0.11 0.09 0.13 0.14
Ours (w/o end-to-end) 0.14 0.18 0.12 0.04 0.06 0.18 0.14 0.13 0.16 0.08 0.12 0.12 0.13
Ours 0.29 0.24 0.19 0.08 0.12 0.19 0.14 0.19 0.15 0.10 0.09 0.16 0.16
Table 9: Evaluation of CAD model retrieval by ranking quality on the test split of our Scan-CAD Object Similarity benchmark.