Learning Implicit Functions for Topology-Varying Dense 3D Shape Correspondence

10/23/2020 ∙ by Feng Liu, et al. ∙ Michigan State University 0

The goal of this paper is to learn dense 3D shape correspondence for topology-varying objects in an unsupervised manner. Conventional implicit functions estimate the occupancy of a 3D point given a shape latent code. Instead, our novel implicit function produces a part embedding vector for each 3D point, which is assumed to be similar to its densely corresponded point in another 3D shape of the same object category. Furthermore, we implement dense correspondence through an inverse function mapping from the part embedding to a corresponded 3D point. Both functions are jointly learned with several effective loss functions to realize our assumption, together with the encoder generating the shape latent code. During inference, if a user selects an arbitrary point on the source shape, our algorithm can automatically generate a confidence score indicating whether there is a correspondence on the target shape, as well as the corresponding semantic point if there is one. Such a mechanism inherently benefits man-made objects with different part constitutions. The effectiveness of our approach is demonstrated through unsupervised 3D semantic correspondence and shape segmentation.



There are no comments yet.


page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Finding dense correspondence between D shapes is a key algorithmic component in problems such as statistical modeling Blanz and Vetter (2003); Zuffi et al. (2017); Bogo et al. (2014), cross-shape texture mapping Kraevoy et al. (2003), and space-time D reconstruction Niemeyer et al. (2019). Dense D shape correspondence can be defined as: given two D shapes belonging to the same object category, one can match an arbitrary point on one shape to its semantically equivalent point on another shape if such a correspondence exists. For instance, given two chairs, the dense correspondence of the middle point on one chair’s arm should be the similar middle point on another chair’s arm, despite different shapes of arms; or alternatively, declare the non-existence of correspondence if another chair has no arm. Although prior dense correspondence methods Ovsjanikov et al. (2012); Litany et al. (2017); Groueix et al. (2018a); Halimi et al. (2019); Roufosse et al. (2019); Lee and Kazhdan (2019); Steinke et al. (2007); Liu et al. (2019a) have proven to be effective on organic shapes, e.g., human bodies and mammals, they become less suitable for generic topology-varying or man-made objects, e.g., chair or vehicles Huang et al. (2014). It remains a challenge to build dense D correspondence for a category with large variations in geometry, structure, and even topology. First of all, the lack of annotations on dense correspondence often leaves unsupervised learning the only option. Second, most prior works make an inadequate assumption Van Kaick et al. (2011) that there is a similar topological variability between matched shapes. Man-made objects such as chairs shown in Fig. 1 are particularly challenging to tackle, since they often differ not only by geometric deformations, but also by part constitutions. In these cases, existing correspondence methods for man-made objects either perform fuzzy Kim et al. (2012); Solomon et al. (2012) or part-level Sidi et al. (2011); Alhashim et al. (2015) correspondences, or predict a constant number of semantic points Huang et al. (2017); Chen et al. (2020). As a result, they cannot determine whether the established correspondence is a “missing match” or not. As shown in Fig. 1, for instance, we may find non-convincing correspondences in legs between an office chair and a -legged chair, or even no correspondences in arms for some pairs. Ideally, given a query point on the source shape, a dense correspondence method aims to determine whether there exists a correspondence on the target shape, and the corresponding point if there is. This objective lies at the core of this work. Shape representation is highly relevant to, and can impact, the approach of dense correspondence. Recently, compared to point cloud Achlioptas et al. (2018); Qi et al. (2017a, b) or mesh Groueix et al. (2018b); Georgia Gkioxari (2019); Wang et al. (2018), deep implicit functions have shown to be highly effective as D shape representations Park et al. (2019); Mescheder et al. (2019); Liu et al. (2019b); Saito et al. (2019); Chen et al. (2019); Chen and Zhang (2019); Atzmon et al. (2019), since it can handle generic shapes of arbitrary topology, which is favorable as a representation for dense correspondence. Often learned as a MLP, conventional implicit functions input the D shape represented by a latent code and a query location in the D space, and estimate its occupancy . In this work, we propose to plant the dense correspondence capability into the implicit function by learning a semantic part embedding. Specifically, we first adopt a branched implicit function Chen et al. (2019) to learn a part embedding vector (PEV),

, where the max-pooling of

gives the . In this way, each branch is tasked to learn a representation for one universal part of the input shape, and PEV represents the occupancy of the point w.r.t. all the branches/semantic parts. By assuming that PEVs between a pair of corresponding points are similar, we then establish dense correspondence via an inverse function mapping the PEV back to the

D space. To further satisfy the assumption, we devise an unsupervised learning framework with a joint loss measuring both the occupancy error and shape reconstruction error between

and . In addition, a cross-reconstruction loss is proposed to enforce part embedding consistency by mapping within a pair of shapes in the collection. During inference, based on the estimated PEVs, we can produce a confidence score to distinguish whether the established correspondence is valid or not. In summary, contributions of this work include: We propose a novel paradigm leveraging implicit functions for category-specific unsupervised dense D shape correspondence, which is suitable for objects with diverse variations including varying topology. We devise several effective loss functions to learn a semantic part embedding, which enables both shape segmentation and dense correspondence. Further, based on the learnt part embedding, our method can estimate a confidence score measuring if the predicted correspondence is valid or not. Through extensive experiments, we demonstrate the superiority of our method in shape segmentation and D semantic correspondence.

Figure 1: Given a set of D shapes, our category-specific unsupervised method learns pair-wise dense correspondence (a) between any source and target shape (red box), and shape segmentation (b). Give an arbitrary point on the source shape (red box), our method predicts its corresponding point on any target shape, and a score measuring the correspondence confidence (c). For each target, we show the confidence scores of redred/greengreen points, and score maps around corresponded points. A score less than a threshold (e.g., ) deems the correspondence as “non-existing”– a desirable property for topology-varying shapes with missing parts, e.g., chair’s arm.

2 Related Work

Dense Shape Correspondence While there are many dense correspondence works for organic shapes Ovsjanikov et al. (2012); Litany et al. (2017); Groueix et al. (2018a); Halimi et al. (2019); Roufosse et al. (2019); Lee and Kazhdan (2019); Boscaini et al. (2016); Steinke et al. (2007), due to space, our review focuses on methods designed for man-made objects, including optimization and learning-based methods. For the former, most prior works build correspondences only at a part level Kalogerakis et al. (2010); Huang et al. (2011); Sidi et al. (2011); Alhashim et al. (2015); Zhu et al. (2017). Kim et al. Kim et al. (2012) propose a diffusion map to compute point-based “fuzzy correspondence” for every shape pair. This is only effective for a small collection of shapes with limited shape variations.  Kim et al. (2013) and Huang et al. (2015) present a template-based deformation method, which can find point-level correspondences after rigid alignment between the template and target shapes. However, these methods only predict coarse and discrete correspondence, leaving the structural or topological discrepancies between matched parts or part ensembles unresolved. A series of learning-based methods Yi et al. (2017); Huang et al. (2017); Sung et al. (2018); Muralikrishnan et al. (2019); You et al. (2020) are proposed to learn local descriptors, and treat correspondence as D semantic landmark estimation. E.g., ShapeUnicode Muralikrishnan et al. (2019) learns a unified embedding for D shapes and demonstrates its ability in correspondence among D shapes. However, these methods require ground-truth pairwise correspondences for training. Recently, Chen et al. Chen et al. (2020) present an unsupervised method to estimate D structure points. Unfortunately, it estimates a constant number of sparse structured points. As shapes may have diverse part constitutions, it may not be meaningful to establish the correspondence between all of their points. Groueix et al. Groueix et al. (2019) also learn a parametric transformation between two surfaces by leveraging cycle-consistency, and apply to the segmentation problem. However, the deformation-based method always deforms all points on one shape to another, even the points from a non-matching part. In contrast, our unsupervisedly learnt model can perform pairwise dense correspondence for any two shapes of a man-made object. Implicit Shape Representation Due to the advantages of continuous representation and handling complicated topologies, implicit functions have been adopted for learning representations for D shape generation Chen and Zhang (2019); Mescheder et al. (2019); Park et al. (2019); Liu et al. (2019b), encoding texture Oechsle et al. (2019); Sitzmann et al. (2019); Saito et al. (2019), and D reconstruction Niemeyer et al. (2019). Meanwhile, some works Huang et al. (2004, 2006) leverage the implicit representation together with a deformation model for shape registration. However, these methods rely on the deformation model, which might prevent their usage for topology-varying objects. Slavcheva et al. Slavcheva et al. (2017) present an approach which implicitly obtains correspondence for organic shapes by predicting the evolution of the signed distance field. However, as they require a Laplacian operator to be invariant, it is limited to small shape variations. Recently, some extensions have been proposed to learn deep structured Genova et al. (2019, 2020) or segmented implicit functions Chen et al. (2019), or separate implicit functions for shape parts Paschalidou et al. (2020). However, instead of at a part level, we extend implicit functions for unsupervised dense shape correspondence.

3 Proposed Method

Let us first formulate the dense D correspondence problem. Given a collection of D shapes of the same object category, one may encode each shape in a latent space . For any point in the source shape , dense D correspondence will find its semantic corresponding point in the target shape if a semantic embedding function (SEF) is able to satisfy


Here the SEF is responsible for mapping a point from its D Euclidean space to the semantic embedding space. When and have sufficiently similar locations in the semantic embedding space, they have similar semantic meaning, or functionality, in their respective shapes. Hence is the corresponding point of . On the other hand, if their distance in the embedding space is too large (), there is not a corresponding point in for . If SEF could be learned for a small , the corresponded point of can be solved via , where is the inverse function of that maps a point from the semantic embedding space back to the D space. Therefore, the dense correspondence amounts to learning the SEF and its inverse function. Toward this goal, we propose to leverage the topology-free implicit function, a conventional shape representation, to jointly serve as SEF. By assuming that corresponding points are similar in the embedding space, we explicitly implement an inverse function mapping from the embedding space to the D space, so that the learning objectives can be more conveniently defined in the D space rather than the embedding space. Both functions are jointly learned with an occupancy loss for accurate shape representation, and a self-reconstruction loss for the inverse function to recover itself. In addition, we propose a cross-reconstruction loss enforcing two objectives. One is that the two functions can deform source shape points to be sufficiently close to the target shape. The other is that corresponding offset vectors, , are locally smooth within the neighbourhood of .

3.1 Implicit Function and Its Inverse

Implicit Function As in Chen and Zhang (2019); Mescheder et al. (2019), a shape is first encoded as a shape code by a PointNet  Qi et al. (2017a). Given the D coordinate of a query point

, the implicit function assigns an occupancy probability

between and , where indicates is inside the shape, and outside. This conventional function can not serve as SEF, given its simple D output. Motivated by the unsupervised part segmentation Chen et al. (2019), we adopt its branched layer as the final layer of our implicit function, whose output is denoted by in Fig. 2: . A max-pooling operator () leads to the final occupancy by selecting one branch, whose index indicates the unsupervisedly estimated part where belongs to. Conceptually, each element of shall indicate the occupancy value of w.r.t. the respective part. Since appears to represent the occupancy of w.r.t. all semantic parts of the object, the latent space of can be the desirable semantic embedding, and thus we term as the part embedding vector (PEV) of . In our implementation, is composed of fully connected layers each followed by a LeakyReLU, except the final output (Sigmoid).

Figure 2: Model Overview. (a) Given a shape , PointNet is used to extract the shape feature code . Then a part embedding is produced via a deep implicit function . We implement dense correspondence through an inverse function mapping from to recover the D shape . (b) To further make the learned part embedding consistent across all the shapes, we randomly select two shapes and . By swapping the part embedding vectors, a cross reconstruction loss is used to enforce the inverse function to recover to each other.

Inverse Implicit Function Given the objective function in Eqn. 1, one may consider that learning SEF, , would be sufficient for dense correspondence. However, this has two issues. 1) To find correspondence of , we need to compute , i.e., assuming the output of equals and solve for via iterative back-propagation. This can be inefficient during inference. 2) It is easier to define shape-related constraints or losses between and in the D space, than those between and in the embedding space. To this end, we define the inverse implicit function to take PEV and the shape code as inputs, and recover the corresponding D location: . We use a multilayer perception (MLP) network to implement . With , we can efficiently compute via forward passing, without iterative back-propagation.

3.2 Training with Loss Functions

We jointly train our implicit function and inverse function by minimizing three losses: occupancy loss , self-reconstruction loss , and cross-reconstruction loss , i.e.,


where measures how accurately predicts the occupancy of the shapes, enforces is an inverse function of , and strives for part embedding consistency across all shapes in the collection. We first explain how we prepare the training data, then detail our losses. Training Samples Given a collection of raw D surfaces with consistent upright orientation, we first normalize the raw surfaces by uniformly scaling the object such that the diagonal of its tight bounding box has a constant length and make the surfaces watertight by converting them to voxels. Following the sample scheme of Chen and Zhang (2019), for each shape, we obtain spatial points and their occupancy label , which is for the inside points and otherwise. In addition, we uniformly sample surface points to represent D shapes, resulting in . Occupancy Loss This is a error between the label and estimated occupancy of all shapes:


Self-Reconstruction Loss We supervise the inverse function by recovering input surface points :


where is the -th vertex of shape . Cross-Reconstruction Loss The cross-reconstruction loss is designed to encourage the resultant PEVs to be similar for densely corresponded points from any two shapes. As in Fig. 2, from a shape collection we first randomly select two shapes and . The implicit function generates PEVs () given () and their respective shape codes () as inputs. Then we swap their PEVs and send the concatenated vectors to the inverse function : , . If the part embedding is point-to-point consistent across all shapes, the inverse function should recover each other, i.e., , . Towards this goal, we exploit several loss functions to minimize the pairwise difference between those shapes:


where is Chamfer distance (CD) loss, Earth Mover distance (EMD) loss, surface normal loss, smooth correspondence loss, and are the weights. The first three terms focus on the shape similarity, while the last one encourages the correspondence offsets to be locally smooth. Chamfer distance loss is defined as:


where CD is calculated as Qi et al. (2017a): . Earth mover distance loss is defined as:


where EMD is the minimum of sum of distances between a point in one set and a point in another set over all possible permutations of correspondences Qi et al. (2017a): , where is a bijective mapping. Surface normal loss An appealing property of implicit representation is that the surface normal can be analytically computed using the spatial derivative via back-propagation through the network. Hence, we are able to define the surface normal distance on the point sets.


where is the surface normal of . We measure

by the Cosine similarity distance:

, where denotes the dot-product. Smooth correspondence loss encourages that the correspondence offset vectors , of neighboring points are as similar as possible to ensure a smooth deformation:


where , , and are neighborhoods for and respectively.

3.3 Inference

During inference our method can offer both shape segmentation and dense correspondence for D shapes. As each element of PEV learns a compact representation for one common part of the shape collection, the shape segmentation of is the index of the element being max-pooled from its PEV. As both the implicit function and its inverse are point-based, the number of input points to can be arbitrary during inference. Given two point sets , with shape codes and , generates PEVs and , and outputs cross-reconstructed shape . For any query point , a preliminary correspondence may be found by a nearest neighbour search in : . Knowing the index of in , the same index in refers to the final correspondence . Here, the nearest neighbor search might not be optimal as it limits the solution to the already sampled points in . An alternative is that, once the preliminary correspondence is found, within its neighbourhood, we can search an surface point who is closer to than . As our input shapes are densely sampled, this alternative does not provide notable benefits, and thus we use the first approach. Finally, we compute the correspondence confidence as , where is normalized to the range of , and is the index of in . Since the learned part embedding is discriminative among different parts of a shape, the distance of PEVs is suitable to define the confidence. When is larger than a pre-defined threshold , this correspondence is valid; otherwise has no correspondence.

3.4 Implementation Detail

Our method is trained in three stages: ) PointNet and implicit function are trained on sampled point-value pairs via Eqn. 3. ) , , and inverse function are jointly trained via Eqn. 3 and 4. ) We jointly train , and with . In experiments, we set , , , , , , ,

. We implement our model in Pytorch and use Adam optimizer at a learning rate of

in all stages.

Figure 3: Correspondence accuracy for categories in the BHCP benchmark. The dashed lines indicate the methods are rotation-invariant and for the unaligned setting. All baseline results are quoted from Chen et al. (2020); Kim et al. (2013).
Figure 4: (a) Dense correspondences in categories. Each row shows one target shape (red box) and its pair-wise corresponded source shapes . Given a spatially colored , the correspondence enables to assign with the color of , or with red if is non-existing. (b) For one pair, the non-existence correspondences are impacted by the confidence threshold (, , and from top to bottom).

4 Experiments

4.1 3D Semantic Correspondence

Data We evaluate on D semantic point correspondence, a special case of dense correspondence, with two motivations: 1) no database of man-made objects has ground-truth dense correspondence; 2) there is far less prior work in dense correspondence for man-made objects, than the semantic correspondence task, which has strong baselines for comparison. Thus, to evaluate semantic correspondence, we train on ShapeNet Chang et al. (2015) and test on BHCP Kim et al. (2013) following the setting of Huang et al. (2017); Chen et al. (2020). For training, we use a subset of ShapeNet including plane (), bike (), chair () categories to train individual models. For testing, BHCP provides ground-truth semantic points (- per shape) of shapes including plane (), bike (), chair (), helicopter (). We generate all pairs of shapes for testing, e.g., pairs for bike. The helicopter is tested with the plane model as Huang et al. (2017); Chen et al. (2020) did. As BHCP shapes are with rotations, prior works test on either one or both settings: aligned and unaligned (i.e., vs. arbitrary relative pose of two shapes). Baseline We compare our work with multiple state-of-the-art (SOTA) baselines. Kim12 Kim et al. (2012) and Kim13 Kim et al. (2013) are traditional optimization methods that require part label for templates and employ collection-wise co-analysis. LMVCNN Huang et al. (2017), ShapeUnicode Muralikrishnan et al. (2019), AtlasNet2 Deprelle et al. (2019) and Chen et al. Chen et al. (2020) are all learning based, where Huang et al. (2017); Muralikrishnan et al. (2019) require ground-truth correspondence labels for training. Despite Chen et al. (2020) only estimates a fixed number of sparse points,  Chen et al. (2020) and ours are trained without labels. As optimization-based methods and Huang et al. (2017) are designed for the unaligned setting, we also train a rotation-invariant version of ours by supervising to predict an additional rotation matrix and applying it to rotate the input point before feeding to . Results The correspondence accuracy is measured by the fraction of correspondences whose error is below a given threshold of Euclidean distances. As in Fig. 3, the solid lines show the results on the aligned data and dotted lines on the unaligned data. We can clearly observe that our method outperforms baselines in plane, bike and chair categories on aligned data. Note that Kim13 Kim et al. (2013) has a slightly higher accuracy than ours on the helicopter category, likely due to the fact that  Kim et al. (2013) tests with the helicopter-specific model, while we test on the unseen helicopter category with a plane-specific model. At the distance threshold of , our method improves on average accuracy in categories over  Chen et al. (2020). For unaligned data, our method achieves competitive performance as baselines. While it has the best AUC overall, it is worse at the threshold between . The main reason is the implicit network itself is sensitive to rotation. Note that this comparison shall be viewed in the context that most baselines use extra cues during training or inference, as well as high inference speed of our learning-based approach. Some visual dense correspondence results are shown in Fig. 4. Note the amount of non-existent correspondence is impacted by the threshold as in Fig. 4. A larger discovers more subtle non-existence correspondences. This is expected as the division of semantically corresponded or not can be blurred for some shape parts. By only finding the closest points on aligned D shapes, we report its semantic correspondence accuracy as the black curve in Fig. 6. Clearly, our accuracy is much higher than this “lower bound", indicating our method doesn’t rely much on the canonical orientation. To further validate on noisy real data, we evaluate on the Chair category with additive noise and compare with Chen et al. Chen et al. (2020). As shown in Fig. 6, the accuracy is slightly worse than testing on clean data. However, our method still outperforms the baseline on noisy data. Detecting Non-Existence of Correspondences Our method can build dense correspondences for D shapes with different topologies, and automatically declare the non-existence of correspondence. The experiment in Fig. 3 cannot fully depict this capability of our algorithm as no semantic point was annotated on a non-matching part. Also, there is no benchmark providing the non-existence label between a shape pair. We thus build a dataset with paired shapes from the chair category of ShapeNet part dataset. Within a pair, one has the arm part while the other does not. For the former, we annotate arm points and non-arm points based on provided part labels. As correspondences don’t exist for the arm points, we can utilize this data to measure our detection of non-existence of correspondence. Based on our confidence scores, we report the ROC in Fig. 6. The AUC shows our strong capability in detecting no correspondence.

4.2 Unsupervised Shape Segmentation

In testing, unlike prior template-based Kim et al. (2013) or feature point estimation methods Chen et al. (2020), we don’t need to transfer any segmentation labels. Thus, we only compare with the SOTA unsupervised segmentation method BAE-Net Chen et al. (2019). Following the same protocol Chen et al. (2019), we train category-specific models and test on the same categories of ShapeNet part dataset Yi et al. (2016): plane (), bag (), cap (), chair (), mug (), skateboard (), table (), and chair* (a joint chair+table set with shapes). Intersection over Union (IoU) between prediction and the ground-truth is a common metric for segmentation. Since unsupervised segmentation is not guaranteed to produce the same part counts exactly as the ground-truth, e.g., combining the seat and back of a chair as one part, we report a modified IoU Chen et al. (2019) measuring against both parts and part combinations in the ground-truth. As in Tab. 1, our model achieves a consistently higher segmentation accuracy for all categories than BAE-Net. As BAE-Net is very similar to our model trained in Stage , these results show that our dense correspondence task helps the PEV to better segment the shapes into parts, thus producing a more semantically meaningful embedding. Some visual results of segmentation are shown in Fig. 5.

Shape (#parts) plane () bag () cap () chair () chair* () mug () skateboard () table () Aver.
leg, arm
back, seat,
leg, arm
BAE-Net Chen et al. (2019)
Proposed 88.0
Table 1: Unsupervised segmentation on ShapeNet part. We use #parts in evaluation and = for all models.
Figure 5: Qualitative results of our unsupervised segmentation in Tab. 1: shapes in each of the categories.

4.3 Ablations and Visualizations

Shape Representation Power of Implicit Function We hope our novel implicit function still serves as a shape representation while achieving dense correspondence. Hence its shape representation power needs to be evaluated. Following the setting of Tab. 1, we first pass a ground-truth point set from the test set to and extract the shape code . By feeding and a grid of points to , we can reconstruct the D shape by Marching Cubes. We evaluate how well the reconstruction matches the ground-truth point set. The average Chamfer distance (CD) between ours and branched implicit function (BAE-Net) on the categories is and (), respectively. The lower CD shows that our novel design of semantic embedding actually improves the shape representation.

Figure 6: (a) Additional semantic correspondence results for the chair category in BHCP. (b) ROC curve of non-existence of correspondence detection. (c) Shape segmentation and (d) D semantic correspondence performances on the chair category over different dimensionalities of PEV.

Loss Terms on Correspondence Since the point occupancy loss and self-reconstruction loss are essential, we only ablate each term in the cross-reconstruction loss for the chair category. Correspondence results in Fig. 7 demonstrate that, while all loss terms contribute to the final performance, and are the most crucial ones. forces to resemble . Without , it is possible that may resemble well, but with erroneous correspondences locally. Part Embedding over Training Stages The assumption of learned PEVs being similar for corresponding points motivates our algorithm design. To validate this assumption, we visualize the PEVs of semantic points, defined in Fig. 7, with their ground-truth corresponding points across chairs. The t-SNE visualizes the -dim PEVs in a D plot with one color per semantic point, after each training stage. The model after Stage training resembles BAE-Net. As in Fig. 7, the points corresponding to the same semantic point, i.e., D points of the same color, scatter and overlap with other semantic (colored) points. With the inverse function and self-reconstruction loss in Stage , the part embedding shows more promising grouping of colored points. Finally, the part embedding after Stage has well clustered and more discriminative grouping, which means points corresponding to the same semantic location do have similar PEVs. The improvement trend of part embedding across stages shows the effectiveness of our loss design and training scheme. One-hot vs. Continuous Embedding Ideally, BAE-Net Chen et al. (2019) should output a one-hot vector before , which would benefit unsupervised segmentation the most. In contrast, our PEVs prefer a continuous embedding rather than one-hot. To better understand PEV, we compute the statistics of Cosine Similarity (CS) between the PEVs and their corresponding one-hot vectors: (BAE-Net) vs.  (ours). This shows our learnt PEVs are approximately

one-hot vectors. Compared to BAE-Net, our smaller CS and larger variance are likely due to the limited network capability, as well as our encouragement to learn a continuous embedding benefiting correspondence.

Dimensionality of PEV Fig. 6 and 6 show the shape segmentation and semantic correspondence results over the dimensionality of PEV. Our algorithm performs the best in both when . Despite unsupervisedly segmenting chairs into parts, the extra dimensions of PEV benefit the finer-grained task of correspondence (Fig. 7), which in turns help segmentation.

Figure 7: (a) D semantic correspondence reflecting the contribution of our loss terms. (b) semantic points overlaid with the shape. (c) The t-SNE of the estimated PEVs over training stages. Points of the same color are the PEVs of ground-truth corresponding points in chairs. colors refer to the points in (b). In , only the elements of PEVs max-pooled for -part chair segmentation are fed to t-SNE; uses extra elements.

Computation Time Our training on one category ( samples) takes hours to converge with a GTXTi GPU, where , , and hours are spent at Stage , , respectively. In inference, the average runtime to pair two shapes () is second including runtimes of , , networks, neighbour search and confidence calculation.

5 Conclusion

In this work, we propose a novel framework including an implicit function and its inverse for dense D shape correspondences of topology-varying objects. Based on the learnt semantic part embedding via our implicit function, dense correspondence is established via the inverse function mapping from the part embedding to the corresponding D point. In addition, our algorithm can automatically calculate a confidence score measuring the probability of correspondence, which is desirable for man-made objects with large topological variations. The comprehensive experimental results show the superiority of the proposed method in unsupervised shape correspondence and segmentation.

Broader Impact

Product design (e.g., furniture ) is labor extensive and requires expertise in computer graphics. With the increasing number and diversity of D CAD models in online repositories, there is a growing need for leverage them to facilitate future product development due to their similarities in function and shape. Towards this goal, our proposed method provide a novel unsupervised paradigm to establish dense correspondence for topology-varying objects, which is a prerequisite for shape analysis and synthesis. Furthermore, as our approach is designed for generic objects, its application space can be extremely wide.


The authors would like to thank the reviewers and area chairs for their valuable comments and suggestions. We acknowledge Vladimir G. Kim and Nenglun Chen for sharing data and results.


In this supplementary material, we provide: Implementation details, including network structures and training details. Additional experimental results, including expressiveness of the inverse implicit function and visualization of the correspondence confidence score.

A Implementation Details

a.1 Network Structures

PointNet Encoder .

To extract the shape code, we adopt PointNet Qi et al. (2017a) like architecture as our encoder. The detailed architecture of is depicted in Fig. 8(a). The Encoder takes a point set as input and generates a -dim shape latent code .

Implicit Function .

The implicit function network follows the work of Chen et al. (2019) (unsupervised case). The implicit function takes the shape code and a spatial point as inputs and predicts the part embedding vector (PEV) . As shown in Fig. 8(b), it is composed of fully connected (FC) layers each of which is applied with , except the final output is applied a activation.

Inverse Implicit Function .

The inverse implicit function is also implemented as an MLP, which is composed of FC layers each of which is applied with , except the final output is applied a activation. As shown in Fig 8(c), the inverse implicit function network inputs the PEVs and shape latent code, and recover the corresponding D points.

(a) PointNet-based encoder
(b) Implicit function
(c) Inverse implicit function
Figure 8: Network Architectures. (a) The PointNet-based encoder network. A shape code is predicted from the input point set. denotes the max-pooling operator. (b) The implicit function network is composed of fully connected layers, denotes as “FC”. The shape code is concatenated, denoted as “+”, with the xyz query, making a -dim vector, and is provided as input to the first layer. The activation is applied to the first FC layers while the part embedding vector is obtained with a activation. Finally, a max-pooling operator gives the final occupancy value . (c) The inverse implicit function network is also implemented as a MLP, which is composed of FC layers. Specifically, it takes PEV and the shape code as inputs, and recover the corresponding D location.

a.2 Training Details

Sampling Point-Value Pairs.

The training of implicit function network needs point-value pairs. Following the sampling strategy of Chen and Zhang (2019), we obtain the paired data offline. are the spatial point and the corresponding occupancy label. We sample points from the voxel models in different resolutions: (), () and () in order to train the implicit function progressively.

Training Process

We summarize the training process in Tab. 2. In Stage , we adopt a progressive training technique Chen and Zhang (2019) to train our implicit function on gradually increasing resolution data (), which stabilizes and significantly speeds up the training process.

Network Loss
Stage ,
Stage , , and
Stage , ,
Table 2: Stages of the training process.

B Additional Experimental Results

A supplementary video is provided to visualize additional results, explained as follows.

b.1 Expressiveness of Inverse Implicit Function

Given our inverse implicit function, we are able to cross-reconstruct each other between two paired shapes by swapping their part embedding vectors. Further, we can interpolate shapes both in shape latent space and

D space and maintain the point-level correspondence consistently.

Cross-Reconstruction Performance.

We first show the cross-reconstruction performances in the supplementary video. From a shape collection, we can randomly select two shapes and . Their shape codes and can be predicted by the PointNet encoder. With their respectively generated PEVs and , we can swap their PEVs and send the concatenated vectors to the inverse function and obtain , . As shown in the video, the cross reconstructions closely resemble each other, even with different part constitutions. Here, we also provide the cross-reconstruction performance of two additional object categories: car and table.

Interpolation in Latent Space.

An alternative way to explore the correspondence ability of the inverse implicit function, is to evaluate the interpolation capability of the inverse implicit function. In this experiment, we first interpolate shapes in the latent space (), and send the concatenated vectors ( and ) to the inverse function. As observed in the video, our inverse implicit function generalizes well the different shape deformations. Moreover, the correspondences are point-to-point consistent across all the deformations. It also demonstrates that the learned part embedding is discriminative among different parts of shape and point-wise consistent among different shapes.

Latent Interpolation Comparison.

We compare the latent interpolate capability with conventional implicit function. For the conventional implicit function, we sample a grid of points and pass them to the implicit function to obtain its value. With the threshold of , we obtain the surface points. As can be observed in the video, the interpolation performance of our inverse implicit function is better than conventional implicit function in shape generation and deformation. Furthermore, our interpolations are point-to-point correspondence across all the deformations.

Interpolation in 3D Space.

We also show the interpolation capability of the corresponding points in the D space in the video. Given the estimated dense correspondence, we can compute the correspondence offset vectors for all corresponding pairs of points. Assuming we interpolate the correspondence in video frames, for each frame we move all points of by the amount of and show the moved points. It can be observed that our deformed shape is meaningful and a semantic blending of two shapes. In addition, the correspondence offsets are locally smooth in the D space.

b.2 Visualization of the Correspondence Confidence Score

To further visualize the correspondence confidence score, we provide the confidence score maps for some examples in Figure of the paper. As shown in the video, the confidence score can show the probability around corresponded points between the target shape (red box) and its pair-wise source shapes. For example, for the source shapes with arms, we can clearly see the confidence scores of the arm part is significantly lower than other parts.


  • P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. Guibas (2018) Learning representations and generative models for 3D point clouds. In ICML, Cited by: §1.
  • I. Alhashim, K. Xu, Y. Zhuang, J. Cao, P. Simari, and H. Zhang (2015) Deformation-driven topology-varying 3D shape correspondence. TOG. Cited by: §1, §2.
  • M. Atzmon, N. Haim, L. Yariv, O. Israelov, H. Maron, and Y. Lipman (2019) Controlling neural level sets. In NeurIPS, Cited by: §1.
  • V. Blanz and T. Vetter (2003) Face recognition based on fitting a 3D morphable model. TPAMI. Cited by: §1.
  • F. Bogo, J. Romero, M. Loper, and M. J. Black (2014) FAUST: dataset and evaluation for 3D mesh registration. In CVPR, Cited by: §1.
  • D. Boscaini, J. Masci, E. Rodolà, and M. Bronstein (2016)

    Learning shape correspondence with anisotropic convolutional neural networks

    In NeurIPS, Cited by: §2.
  • A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu (2015) ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012. Cited by: §4.1.
  • N. Chen, L. Liu, Z. Cui, R. Chen, D. Ceylan, C. Tu, and W. Wang (2020) Unsupervised learning of intrinsic structural representation points. In CVPR, Cited by: §1, §2, Figure 3, §4.1, §4.2.
  • Z. Chen, K. Yin, M. Fisher, S. Chaudhuri, and H. Zhang (2019)

    BAE-NET: branched autoencoder for shape co-segmentation

    In ICCV, Cited by: §A.1, §1, §2, §3.1, §4.2, §4.3, Table 1.
  • Z. Chen and H. Zhang (2019) Learning implicit fields for generative shape modeling. In CVPR, Cited by: §A.2, §A.2, §1, §2, §3.1, §3.2.
  • T. Deprelle, T. Groueix, M. Fisher, V. Kim, B. Russell, and M. Aubry (2019) Learning elementary structures for 3D shape generation and matching. In NeurIPS, Cited by: §4.1.
  • K. Genova, F. Cole, A. Sud, A. Sarna, and T. Funkhouser (2020) Local deep implicit functions for 3D shape. In CVPR, Cited by: §2.
  • K. Genova, F. Cole, D. Vlasic, A. Sarna, W. T. Freeman, and T. Funkhouser (2019) Learning shape templates with structured implicit functions. In ICCV, Cited by: §2.
  • J. J. Georgia Gkioxari (2019) Mesh R-CNN. In ICCV, Cited by: §1.
  • T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry (2018a) 3D-CODED: 3D correspondences by deep deformation. In ECCV, Cited by: §1, §2.
  • T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry (2018b) AtlasNet: a papier-mâché approach to learning 3D surface generation. In CVPR, Cited by: §1.
  • T. Groueix, M. Fisher, V. G. Kim, B. C. Russell, and M. Aubry (2019) Unsupervised cycle-consistent deformation for shape matching. In Computer Graphics Forum, Cited by: §2.
  • O. Halimi, O. Litany, E. Rodola, A. M. Bronstein, and R. Kimmel (2019) Unsupervised learning of dense shape correspondence. In CVPR, Cited by: §1, §2.
  • H. Huang, E. Kalogerakis, S. Chaudhuri, D. Ceylan, V. G. Kim, and E. Yumer (2017) Learning local shape descriptors from part correspondences with multiview convolutional networks. TOG. Cited by: §1, §2, §4.1.
  • H. Huang, E. Kalogerakis, and B. Marlin (2015)

    Analysis and synthesis of 3D shape families via deep-learned generative models of surfaces

    In Computer Graphics Forum, Cited by: §2.
  • Q. Huang, V. Koltun, and L. Guibas (2011)

    Joint shape segmentation with linear programming

    In SIGGRAPH Asia, Cited by: §2.
  • Q. Huang, F. Wang, and L. Guibas (2014) Functional map networks for analyzing and exploring large shape collections. TOG. Cited by: §1.
  • X. Huang, N. Paragios, and D. N. Metaxas (2006) Shape registration in implicit spaces using information theory and free form deformations. TPAMI. Cited by: §2.
  • X. Huang, S. Zhang, Y. Wang, D. Metaxas, and D. Samaras (2004) A hierarchical framework for high resolution facial expression tracking. In CVPRW, Cited by: §2.
  • E. Kalogerakis, A. Hertzmann, and K. Singh (2010) Learning 3D mesh segmentation and labeling. In SIGGRAPH, Cited by: §2.
  • V. G. Kim, W. Li, N. J. Mitra, S. Chaudhuri, S. DiVerdi, and T. Funkhouser (2013) Learning part-based templates from large collections of 3D shapes. TOG. Cited by: §2, Figure 3, §4.1, §4.2.
  • V. G. Kim, W. Li, N. J. Mitra, S. DiVerdi, and T. Funkhouser (2012) Exploring collections of 3D models using fuzzy correspondences. TOG. Cited by: §1, §2, §4.1.
  • V. Kraevoy, A. Sheffer, and C. Gotsman (2003) Matchmaker: constructing constrained texture maps. TOG. Cited by: §1.
  • S. C. Lee and M. Kazhdan (2019) Dense point-to-point correspondences between genus-zero shapes. In Computer Graphics Forum, Cited by: §1, §2.
  • O. Litany, T. Remez, E. Rodolà, A. Bronstein, and M. Bronstein (2017) Deep functional maps: structured prediction for dense shape correspondence. In ICCV, Cited by: §1, §2.
  • F. Liu, L. Tran, and X. Liu (2019a) 3D face modeling from diverse raw scan data. In ICCV, Cited by: §1.
  • S. Liu, S. Saito, W. Chen, and H. Li (2019b) Learning to infer implicit surfaces without 3D supervision. In NeurIPS, Cited by: §1, §2.
  • L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger (2019) Occupancy networks: learning 3D reconstruction in function space. In CVPR, Cited by: §1, §2, §3.1.
  • S. Muralikrishnan, V. G. Kim, M. Fisher, and S. Chaudhuri (2019) Shape unicode: A unified shape representation. In CVPR, Cited by: §2, §4.1.
  • M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger (2019) Occupancy flow: 4D reconstruction by learning particle dynamics. In ICCV, Cited by: §1, §2.
  • M. Oechsle, L. Mescheder, M. Niemeyer, T. Strauss, and A. Geiger (2019) Texture fields: learning texture representations in function space. In ICCV, Cited by: §2.
  • M. Ovsjanikov, M. Ben-Chen, J. Solomon, A. Butscher, and L. Guibas (2012) Functional maps: a flexible representation of maps between shapes. TOG. Cited by: §1, §2.
  • J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove (2019) DeepSDF: learning continuous signed distance functions for shape representation. In CVPR, Cited by: §1, §2.
  • D. Paschalidou, L. van Gool, and A. Geiger (2020) Learning unsupervised hierarchical part decomposition of 3D objects from a single rgb image. In CVPR, Cited by: §2.
  • C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017a) Pointnet: deep learning on point sets for 3D classification and segmentation. In CVPR, Cited by: §A.1, §1, §3.1, §3.2.
  • C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017b) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In NeurIPS, Cited by: §1.
  • J. Roufosse, A. Sharma, and M. Ovsjanikov (2019) Unsupervised deep learning for structured shape matching. In ICCV, Cited by: §1, §2.
  • S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li (2019) PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In ICCV, Cited by: §1, §2.
  • O. Sidi, O. van Kaick, Y. Kleiman, H. Zhang, and D. Cohen-Or (2011)

    Unsupervised co-segmentation of a set of shapes via descriptor-space spectral clustering

    In SIGGRAPH Asia, Cited by: §1, §2.
  • V. Sitzmann, M. Zollhöfer, and G. Wetzstein (2019) Scene representation networks: continuous 3D-structure-aware neural scene representations. In NeurIPS, Cited by: §2.
  • M. Slavcheva, M. Baust, and S. Ilic (2017) Towards implicit correspondence in signed distance field evolution. In ICCV, Cited by: §2.
  • J. Solomon, A. Nguyen, A. Butscher, M. Ben-Chen, and L. Guibas (2012) Soft maps between surfaces. In Computer Graphics Forum, Cited by: §1.
  • F. Steinke, V. Blanz, and B. Schölkopf (2007) Learning dense 3D correspondence. In NeurIPS, Cited by: §1, §2.
  • M. Sung, H. Su, R. Yu, and L. J. Guibas (2018) Deep functional dictionaries: learning consistent semantic structures on 3D models from functions. In NeurIPS, Cited by: §2.
  • O. Van Kaick, H. Zhang, G. Hamarneh, and D. Cohen-Or (2011) A survey on shape correspondence. In Computer Graphics Forum, Cited by: §1.
  • N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y. Jiang (2018) Pixel2mesh: generating 3D mesh models from single RGB images. In ECCV, Cited by: §1.
  • L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. Guibas (2016) A scalable active framework for region annotation in 3D shape collections. TOG. Cited by: §4.2.
  • L. Yi, H. Su, X. Guo, and L. J. Guibas (2017) SyncspecCNN: synchronized spectral CNN for 3D shape segmentation. In CVPR, Cited by: §2.
  • Y. You, Y. Lou, C. Li, Z. Cheng, L. Li, L. Ma, C. Lu, and W. Wang (2020) KeypointNet: a large-scale 3D keypoint dataset aggregated from numerous human annotations. In CVPR, Cited by: §2.
  • C. Zhu, R. Yi, W. Lira, I. Alhashim, K. Xu, and H. Zhang (2017) Deformation-driven shape correspondence via shape recognition. TOG. Cited by: §2.
  • S. Zuffi, A. Kanazawa, D. W. Jacobs, and M. J. Black (2017) 3D menagerie: modeling the 3D shape and pose of animals. In CVPR, Cited by: §1.