Log In Sign Up

Unsupervised cycle-consistent deformation for shape matching

by   Thibault Groueix, et al.

We propose a self-supervised approach to deep surface deformation. Given a pair of shapes, our algorithm directly predicts a parametric transformation from one shape to the other respecting correspondences. Our insight is to use cycle-consistency to define a notion of good correspondences in groups of objects and use it as a supervisory signal to train our network. Our method does not rely on a template, assume near isometric deformations or rely on point-correspondence supervision. We demonstrate the efficacy of our approach by using it to transfer segmentation across shapes. We show, on Shapenet, that our approach is competitive with comparable state-of-the-art methods when annotated training data is readily available, but outperforms them by a large margin in the few-shot segmentation scenario.


page 6

page 7


Shape correspondences from learnt template-based parametrization

We present a new deep learning approach for matching deformable shapes b...

DOVE: Learning Deformable 3D Objects by Watching Videos

Learning deformable 3D objects from 2D images is an extremely ill-posed ...

Unsupervised Deep Multi-Shape Matching

3D shape matching is a long-standing problem in computer vision and comp...

Identity-Disentangled Neural Deformation Model for Dynamic Meshes

Neural shape models can represent complex 3D shapes with a compact laten...

Image Collation: Matching illustrations in manuscripts

Illustrations are an essential transmission instrument. For an historian...

G-MSM: Unsupervised Multi-Shape Matching with Graph-based Affinity Priors

We present G-MSM (Graph-based Multi-Shape Matching), a novel unsupervise...

Learning elementary structures for 3D shape generation and matching

We propose to represent shapes as the deformation and combination of lea...

1 Introduction

Figure 1: Shape deformation with cycle-consistency. Our approach takes a pair of pointclouds as input and predicts a deformation of A into B. During training, a cycle-consistent loss on a shape triplet (A, B, C) allows the method to learn semantically consistent deformations , , without any priors. Red arrows represent the learned shape deformation function and green arrows indicate the projection of the deformed shape onto the nearest point on the surface of the target shape.

Large collections of 3D models enable data-driven techniques for interactive geometry modeling, shape synthesis, image-based reconstruction, and shape completion [MWZ14]. Many of these techniques require the collection to have additional surface annotations such as segmentation into functional [YKC16] or geometric [LSD18] parts. The notion of parts and their granularity can vary significantly across different tasks, so many novel applications require new types of annotations [MZC19, YLZ19, WZS19]

. Deep learning algorithms have recently achieved state-of-the-art in automatically predicting such surface annotations 

[QSMG16, QYSG17, WSL18]. However, they typically require a significant number of training examples for every shape category, which limits their applicability, and bears significant start-up cost in introducing a new type of annotation. In this work, we propose a new deep learning approach which leverages large non-annotated object collections to perform few-shot segmentation.

We rely on the idea to use shape matching to transfer labels from similar examples. This approach has been shown to be robust in extreme “few-shot” learning scenarios [YKC16]

and can work robustly even in heterogeneous datasets as long as labeled models roughly span all the shape variations. The few-shots segmentation problem then amount to the fundamental problem of identifying correspondences between shapes. There is a vast amount of work on shape matching, which can be roughly separated in two trends: (i) classical optimization based approaches; (ii) recent approaches where correspondences are directly predicted by a neural network.

Traditional, optimization-based methods such as iterative closest point (ICP) algorithm, are fast and effective with good initial guesses and few degrees of freedom (e.g., a rigid motion) 

[RL01]. More flexible correspondence algorithms for dissimilar models usually require significantly more compute time to optimize for larger number of degrees of freedom [BR07, KLF11, CK15]. Since directly matching dissimilar shapes poses significant challenges, these methods often rely on joint analysis of the entire collection [KLM12], leveraging cycle consistency priors during optimization [HG13, NBCW11]

. These joint correspondence estimation methods tend to be very compute heavy and as new models are added to the collection, the entire optimization needs to be repeated. We thus turned to deep learning-based approaches.

Indeed, with the recent advances in neural networks for geometry analysis, learning-based methods have been proposed to address the matching problem. Of particular interest to us is the method of Groueix et al. [GFK18a], which demonstrate that one can learn how to deform a human body template to the target point cloud, even without correspondence supervision. In their approach, the target point cloud is encoded into a latent descriptor space (via PointNet encoder [QSMG16]), and then the deformation network takes the target descriptor and a point on the template, and maps the point to new position so that it aligns to the target. This approach is efficient, since it only requires a forward pass through a network. It also has the benefit of holistic understanding of shape deformations, since the same neural network is trained for all models in the input collection. However, it has to be trained specifically for each template, limiting this method to analysis of geometrically and topologically similar shape collections, such as human bodies. If such a template is not available, one can pick a very generic shape (e.g., a sphere) and still obtain some correspondences via the intermediate domain [GFK18b]. However, as we will show, the quality of the correspondences will degrade significantly as shapes deviate from that domain.

In this work we propose a novel neural network architecture that learns to match shapes directly, without relying on a pre-defined template, by learning to predict deformations that aligns points on the source shape to points on the target. Note that the transformation can be much more complex than a rigid transformation, and that the space of meaningful transformation is defined implicitly by the (unlabelled) training data. We encode both source and target shapes and then predict the deformed position for every point on the source conditioned on these two codes, unlike prior work that use a fixed template common to all the shapes. We show that the results obtained can be greatly improved if the network is trained not only with a reconstruction loss, which encourages it to deform the source shape into the target shape, but also using a cycle consistency loss. Indeed a deformation which respects correspondences should be consistent between pairs of shapes i.e., the deformation from A to B should be the inverse of the deformation from B to A . More generally, in larger cycles of shapes , global consistency is achieved if the composition of the N successive mappings from to is identity. This new consistency loss used during training can be seen as playing a role similar to the global consistency objective used in optimization-based approaches. Finally, our network is trained in an self-supervised manner using only shape reconstruction and cycle consistency losses.

We demonstrate the effectiveness of our approach for shape matching by propagating segmentations in a few-shot learning setting on the ShapeNet part dataset [YKC16]. We first show that in this extreme case with very few training examples, PointNet [QSMG16], a strongly supervised method, fails to generalize. Then, we propose several strategies for picking source shapes and propagate the signal from them, using our predicted correspondences. We demonstrate that even with a simple strategy, such as picking the source with smallest Chamfer distance, our method is better at transferring segmentations than other fast correspondence techniques such as ICP with rigid transformation and a prior learning-based method that aligns sphere and plane templates [GFK18b].

(a) Parameter prediction network.
(b) Deformation network.
Figure 2: Shape Deformation approach. Our methods take as input a pair (source , target ) of shapes and aims at predicting the deformation of in . In (a), and are encoded with Pointnets [QSMG16]

into a latent feature vector, from which an MLP predicts transformation parameters, used in

(b) to deform into , by stacking Transformation Layers (TL) and Fully-Connected Layers (FC).

2 Related Work

Shape matching is a long-standing problem in shape analysis [vKZHCO11]. It is often done explicitly, by deforming a source shape to a target [RL01, BR07, LSP08, HAWG08, ZSCO08], or implicitly, by mapping points [KLF11, CK15, OMMG10, BBK06] or functions [OBCS12, RPWO18, EBC17] on one shape to another. The deformation-based methods typically aim to minimize the amount of distortion introduced by the deformation, and the mapping-based approaches often assume that shapes to be near-isometric. Both assumptions do not hold for very dissimilar shapes.

To address this challenge, some prior methods leverage additional context of the entire shape collection in a joint optimization [KLM12, NBCW11]. These techniques often use cycle-consistency as additional cue [HZG12, HG13, ROA13]. This, enables estimating correspondences even between dissimilar objects by mapping via intermediate shapes. While these traditional optimization techniques are very powerful, non-rigid matching involves optimizing for many degrees of freedom with complex non-convex objective functions, and takes minutes or hours. To make matters worse, joint analysis usually scales in a super-linear manner with number of models, and if a new shape is added to a collection, the entire optimization needs to be repeated.

Recently, learning-based correspondence techniques were used to address these limitations. They are fast, typically only requiring a forward pass through a neural network, and they enable joint analysis of a collection of shapes, since multiple shapes are typically used during training. Descriptor-based methods embed each shape point into some high-dimensional space, where corresponding points are embedded nearby [HKC18, BMRB16, WHC16]. In most cases, however, a more holistic mapping for the entire shape is often preferred, since it is more capable of preserving the intrinsic shape structure. Litany et al. [LRR17] use a deep neural network to predict a soft inter-surface mapping a common representation used in functional map framework. Groueix et al. [GFK18a] propose to train a network that predicts a deformation for each point on a template. A similar method that uses planes or spheres can be used in case such a template is not available [GFK18b]. These techniques struggle with diverse shape collections when matched shapes have very different topology and geometry. Instead, we propose a method that takes both source and target shape as input and infers the mapping. We also propose a novel regularization term favoring cycle-consistency when mapping across multiple shapes in the collection. A similar cycle-consistency loss for training deep networks to predict correspondences between images of different instances of objects from the same category has recently been used in [ZKA16]. In this work, views rendered from different viewpoints from a 3D model were used to avoid the trivial identity flow solution, but no correspondence between 3D shapes was predicted.

We demonstrate the value of our method for few-shot segmentation transfer. While many techniques have been developed for strongly supervised mesh segmentation [QSMG16, QYSG17, WSL18, LSD18, KAMC17, KHS10], they typically rely on many training examples and fail in a few-shot scenarios (see Table 1). In these cases, some framework propose to rely on propagating annotations from most similar annotated shapes via global or local shape matching [YKC16]. In fact, it is common for correspondence techniques to be evaluated and used for transferring various signals between shapes [OBCS12, KLF11, ACBCO17, CFG15].

3 Learning asymmetric cycle-consistent shape matching

We address the surface matching problem by training a model that takes as inputs a source shape, a target shape, and a point on the source shape and generates the corresponding point on the target shape. As pointed out in Groueix et al. [GFK18a], a learnable model allows for efficient surface matching, which is in contrast to approaches requiring optimization over a collection of pairwise shape matches [NBCW11].

We assume that shapes are represented as point sets sampled from the shapes’ surface. Given point sets and , our goal is to learn a mapping function that takes a 3D point to its corresponding point . If is a function on points and A a set of points, we denote by the set .

First, building on work on unsupervised template-based shape correspondence [GFK18a] we use a Chamfer loss to minimize the distance between deformed source and the target . Unlike prior work, however, we do not assume that all of our shapes are derived from the same template and directly predict template-free correspondences between pairs of shapes.

Second, we seek to leverage the success of cycle consistency, which has been used in shape collection optimization [NBCW11]

and more recently in self-supervised learning 

[ZPIE17], during training of our learnable mapping function. Formally, for shapes that are assumed to be put into correspondence, we enforce that the learnable mapping function satisfies,


We use cycle-consistency training losses for cycles of lengths two and three as it implies consistency for cycles of any length [NBCW11]. We visualize our cycle-consistency loss in Figure 1.

4 Approach

We describe our learnable mapping function , implemented as a two-stage neural network, in Section 4.1, our training losses in Section 4.2, and application to segmentation in Section 4.3.

4.1 Architecture

The architecture of our shape transformation model from a source shape A to a target shape B is visualized in Figure 2 and can be separated into two parts: (a) a parameter prediction network which outputs transformation parameters given the two shapes (Figure 2a); (b) a deformation network that transforms the first shape into the second one using the predicted parameters (Figure 2b). We now describe these two components.

To predict transformation parameters, A and B are first passed into two independent PointNet networks [QSMG16] leading to feature encodings and of size 512. The resulting concatenated descriptor

contains information about the pair (A, B). A multilayer perceptron (MLP) then predicts transformation parameters vectors

from this concatenated feature.

The deformation network (Figure 2b) takes a surface point in and outputs the associated deformed point. The network is composed of modules each with the same architecture. Let’s call the input of module and its output. The operation computed by this module is:


where is the matrix of parameters of a fully-connected layer in , "" refers to the Hadamard (term to term) product,

is the activation function for module

and are the transformation parameters, both in in , corresponding to a scale and a bias in each dimension. Note that this is similar to the architecture of the T-net modules in [QSMG16, JSZ15], but using fewer predicted parameters. Also note that equation 2 is differentiable, which enables the two sub-networks to be trained jointly in an end-to-end fashion. In all of our experiments we used

modules, 64 dimensions for each intermediary feature and ReLU activations for all but the last layer, for which we used a hyperbolic tangent.

We train for 500 epochs with Adam 

[KB14] starting with a learning rate of divided by 10 after 400 epochs.

4.2 Training Losses

We train our deformation by minimizing the weighted sum over several components: a loss enforcing cycle consistency , Chamfer distance loss , and a self reconstruction loss :

We only use the self-reconstruction loss to stabilize the beginning of the training and disable it after 30 epochs to focus on cycle consistency and reconstruction losses. We train all parameters in our network by sampling triplets of shapes which are needed by our 3-cycle consistency and enforcing all other losses on all the associated deformations. We first explain how we sampled these triplets, then detail the different terms of our loss.

4.2.1 Training shape sampling

For our cycle-consistency loss, we require a valid mapping across shape triplet (A, B, C). As different shape categories may have different topologies, we train category-specific networks. Furthermore, as there may be topological changes within a single category, for shape A, we randomly sample shapes B and C from the K nearest neighbors of A under chamfer distance. We take and demonstrate in the ablation study the superiority of this approach over random sampling of shape triplets.

We apply data augmentation on each sampled shape in this order : a random rotation around the axis of a random angle between and , an anisotropic scaling of random scale between and , a bounding box normalization, and a small random translation below 0.03.

4.2.2 Cycle-consistency loss

The cycle consistency loss is based on the intuition that a point deformed through any cycle of deformations should be mapped back to itself. One way to enforce consistency would be to compute composite functions, for two shapes and minimizing for all in . However is typically not an element of , and computing would thus require computing the deformations of other points than the points of . To avoid this, we consider instead projections of the deformed shapes to the target shapes. More precisely, we define the shape projection operator


and enforce 2-cycle consistency between and by minimizing


and cycle consistency for the cycle by minimizing


Our full cycle-consistency loss is simply defined by summing over possible all possible two and three cycles using a sampled triplet (A, B, C).


Enforcing 2- and 3-cycle consistency implies consistency for any cycle [NBCW11].

4.2.3 Reconstruction loss

As discussed in section 3, we want to enforce that every point in the target shape is well reconstructed, but not necessarily that any point in the source shape is mapped to the target shape, in case some part appear in the source and not the target. We thus used asymmetric Chamfer distance to quantify how well the network has generated the target shape. More precisely, given a pair of shapes (X,Y), the asymmetric chamfer computes the average distance between a point and its nearest neighbor in .


Given a training triplet , we define the reconstruction loss by summing the asymmetric chamfer loss on all 6 possible (source, target) couples.


If segmentation is available for the training shapes, we can compute the distance in equation 7 on each segment independently, which would add supervision on the correspondences. We of course do not use such labels for our few-shot learning experiments, but show in Table 2 it can be used if available to slightly boost our results.

4.2.4 Self-reconstruction loss

We can fully supervise the deformation by manually deforming a shape with a known transformation. We found such a supervision was helpful to stabilize and speed up the beginning of our training. Concretely, we sampled deformations similar to what we did for data augmentation (described above in 4.2.1) by composing (1) a rotation, (2) an anisotropic scaling, and (3) a rescaling to a centered bounding box. Given a transformation , we compute the average distance between the two images of a point under and the predicted mapping function .


Our corresponding self-reconstruction loss is the sum of this loss for each of the three point clouds in the triplet (A, B, C) with different random transformations.


4.3 Application to segmentation

Learning a deformation between two shapes provides an intuitive method to transfer label information, such as a part segmentation, from a labeled shape to an unlabeled one. In this formulation, we assume we are given a (small) number of labeled shapes, and seek to label each point on an unlabeled test shape. This requires us to decide which of the labeled shapes we should use as the source to propagate labels to the target shapes.

Selection Criteria. Given a target , We manually define 4 possible source selection criteria:

  • Nearest Neighbor: The source shape that minimizes the Chamfer distance between and is selected.

  • Deformation Distance: The source shape that minimizes the Chamfer distance between and is selected.

  • Cosine Distance: The source shape that minimizes the cosine distance distance between the PointNet encodings and is selected.

  • Cycle Consistency: The source shape that minimizes 2-cycle loss for the pair is selected.

Having selected a pair , labels can be transferred directly with our approach.

Voting strategy. Instead of selecting a single source shape to get labels from, combining several voting shapes allows for better segmentation. We select the K-best sources, and make each source shape vote with equal weight for the label of each target point. We evaluate the benefits of this voting approach in Section 5.2.2.

(a) Input shape
(b) Retrieved shape
(c) Deformed
retrieved shape
(d) Transferred labels
to input
(e) Ground truth
(f) Identity baseline
Figure 3: Qualitative results. For each input shape (a), we select the top nearest neighbor from 400 training examples with part segmentations using the cycle-consistency criterion (b). We apply our approach to deform the retrieved shape to align with the input shape (c). Given the deformed shape, we transfer the labels onto the input shape (d). For each category, we show the top results that maximize IoU with the ground truth (e). For comparison, we show the Identity baseline in (f). Notice how our method successfully transfers labels and improves over the baseline.
(a) Input shape
(b) Retrieved shape
(c) Deformed
retrieved shape
(d) Transferred labels
to input
(e) Ground truth
(f) Identity baseline
Figure 4: Failures. Example failures include when a retrieved shape has inconsistent annotation (rows 1,2,5) and poor deformation due to different topology (rows 3,4).

5 Results

In this section, we show qualitative and quantitative results on the tasks of few-shot and supervised semantic segmentation and compare against several baselines.

Data and evaluation criteria. We evaluated our approach on the standard ShapeNet part dataset [YKC16]. We restricted ourselves to the 5 most populated categories, namely Airplane, Car, Chair, Lamp, and Table. Point clouds sampled on mesh objects are densely labeled for segmentation with one to five parts. We follow Qi et al. [QSMG16] and report the mean intersection over union (mIoU) between the predicted and ground truth segmentation across instances in a category.

Baselines. We compare our unsupervised approach against supervised and unsupervised approaches. We used PointNet as a supervised baseline. Our unsupervised baselines include a learned approach derived from Atlasnet [GFK18b] and variants of iterative closest points (ICP) [Zha94, BM92]. AtlasNet is a template-based reconstruction method that predicts a transformation of the template matching the target shape. The learned deformations have been previously observed to be semantically consistent [GFK18a]. To transfer segmentation labels from a source to a target, we project the source labels on the source reconstruction through nearest neighbors, then on the template through dense correspondence between the template and the source reconstruction. Similarly, we transfer labels on the template to the target by dense correspondence and nearest neighbors. AtlasNet is trained on the same train/test splits as our approach. We consider two settings of AtlasNet – with 10 patches or 1 sphere as the template. Additionally, we use two standard shape alignment baselines. First, labels can be transferred from source to target through nearest neighbor matching, which we call the Identity baseline. An immediate refinement over this baseline is to apply ICP to align the source to the target, and then use nearest neighbors. We call the latter the ICP baseline.

5.1 Qualitative Results

Figure 5: Mapping function quality.

We apply a checkerboard colorization scheme on the source (

left), and use our approach to deform (middle) the source shape to the target shape (right). The labels are transferred from the deformed shape to the target shape through nearest neighbors. For each category, we show a example of good reconstruction (top) and poor reconstruction (bottom). Notice the high quality of the mapping in both cases.

Figure 6: Cycle-consistency performance. We apply a checkerboard colorization scheme on the source (left), and use our approach with cycle-constistency (top) and without (bottom) to deform (middle) the source shape to the target shape (right). The labels are transferred from the deformed shape to the target shape through nearest neighbors.

Correspondences. In figure 5 we visualize in more detail the correspondences obtained with our approach. We visualize how each point on the source shape is deformed and transferred to the target shape using a colored checkerboard. For each example, we show a successful deformation (top) and a failure case (bottom). Note how the checkerboard appears nicely deformed in the case of successful deformation, and still appears consistent on some parts in the failure cases.

Cycle-consistency. In figure 6 we compare the mappings learned by our approach with and without cycle-consistency loss. The Chamfer Distance is a point based loss with no control over the amount of distorsion. Notice in this case that the deformed source has large triangles. It indicates that the mapping learned by a Chamfer loss alone is not smooth, and can’t be used in label tranfer. On the other hand, the cycle-consistency loss leads to a smooth and high quality mapping.

Segmentation transfer. When looking at the results, a first surprising observation is the high quality of the identity baseline (this is quantitatively confirmed in Table 2). Indeed, the different criteria tend to select shapes that are really close to the target. To focus on interesting examples, we selected in Figure 3 the pairs that maximize the performance improvement provided by our method compared to the identity baseline using the cycle-consistency-selection criterion. The richness of the learned deformations allows our method to find meaningful correspondences in cases where the training example is far from the target shape and the identity baseline does not work. Note that the deformations are often far from isometric. Thus, methods that rely on regularization toward identity, a popular approach to regularize learned deformations [GFK18a, KTEM18, WZL18], would likely fail.

Failure cases. Figure 4 shows failures of our method. We show for each category the pair which minimizes our segmentation transfer performance. It is clear that the corresponding shapes are rare and specific object instances. We observe two main sources of errors. First, in some cases where we correctly deform in , the ground truth labeling was inconsistent, leading to large errors. For example, notice how the source airplane has a single label. Second, and are sometimes too distant topologically so that a high-fidelity reconstruction of is impossible by deforming . For example, notice how the pole of the lamp has been erroneously inflated to match the target shape.

5.2 Quantitative Results

5.2.1 Few-shot Segmentation

In this section, we evaluate our approach on the task of transferring semantic labels from a small set of segmented shapes to unlabeled data.

10 shots Selection Criterion Airplane Car Chair Lamp Table
(a) Pointnet -
(b) Atlasnet Patch Nearest Neighbors
(c) Atlasnet Sphere Nearest Neighbors
(d) ICP Nearest Neighbors
(e) Ours Nearest Neighbors
(f) Ours Cycle Consistency
(g) Ours Oracle
Table 1: Few-shot segmentation:. We compare (e, f) our approach with (a) Pointnet [QSMG16], a supervised method, trained per category, (b, c) two unsupervised baselines based on Atlasnet [GFK18b] and (e) ICP. We pre-train all (b, c , e, f) unsupervised approaches on the train splits (without labels). Given a target shape and 10 segmented train samples, we select ’s nearest neighbors S. In Atlasnet (b, c), labels are propagated through the template. In our approach (e, f, g), labels are propagated from to T. We report in (g) the best performance of our method over the 10 shots. The mean IoU is reported. Results are averaged over 10 runs.

We report quantitative results for few-shot semantic segmentation on point clouds in Table 1

. Note that the learning-based methods are all trained separately for each category. Since the results depend on the sampled shapes used in the training set, we report the average and standard deviation over ten randomly sampled training sets. We use the Nearest Neighbors criterion to pair sources and targets and compare our approach against all baselines

(b, c, d, e). Notice that our approach out-performs all baselines on all categories. Interestingly, the AtlasNet baseline is not on par with ICP, hinting at the difficulty of predicting two consistent deformations of the template.

We find that the Cycle Consistency criterion (f) is a stronger selection criterion than Nearest Neighbors and boosts the results simply by selecting a better (Source, Target) pair. We also report an oracle source-shape selection with our approach where the source shape maximising IoU with the target is selected, which corresponds to the scenario where an optimal source shape is selected. Notice the large improvement of the oracle, showing the quality of our deformations and the potential of our method.

Figure 7: Criteria and voting strategies. Study of the number of voting shapes for the transfer of segmentation label, across 4 criteria (see 4.3) - Nearest Neighbors, Deformation Distance, Cosine Distance and Cycle Consistency -, and across 5 Shapenet categories. Our transformation method (solid lines) almost always enhance the identity baseline (dashed lines). We report a supervised baseline, Pointnet [QSMG16] and the oracle source which maximizes IoU for our method. Notice how the oracle significantly outperforms the Pointnet baseline, making the search of a strong selection criterion a good direction. Our models are category specific and trained without segmentation supervision. All of the train set is searched to maximize each criterion.

5.2.2 Supervised segmentation

Our method is not designed to be competitive when many training samples are available. Indeed, it solves for the deformation against each of the provided segmented shapes, which for large numbers of examples can be computationally expensive compared to feed-forward segmentation predictions like PointNet [QSMG16]. One forward pass through our network deforms a source shape in a target shape in 7 milliseconds (ms), with a 7ms standard deviation (std). ICP takes 28 ms with a 17 std111 We use Open3D [ZPK18] to compute ICP ran on Intel i7-6900K - 3.2 GHz and run our method on an NVIDIA TITAN X.. Here, however, we study the performance of our method in this case, using the segmentation of the many training shapes as supervision during training and making the ten best shapes vote during testing. We report results of our unsupervised method. In addition, we consider adding supervision to our approach by computing Chamfer distances over points with the same segmentation label. The corresponding results are reported in Table 2

Table 2 shows that, when using all the annotations, nearest neighbors is again a surprisingly good baseline, only slightly below performance of PointNet. Despite the good performance of the identity baseline, our method outperforms it in all categories and performs on par with PointNet. Note that the encoders of our approach incorporate two PointNet architectures, which makes this result intuitive.

Table 2 also highlights the importance of the criterion selection. Notice the significant boost in each category gained by carefully choosing the selection criterion over the Nearest Neighbors criterion. The exciting performance of the oracle, way over the PointNet baseline, is another incentive at carefully designing selection criteria.

Finally, notice that our unsupervised trained model is on par with our supervised one. The boost gained by supervised training is marginal except in the car category. It confirms that our cycle-consistent loss is efficient to enforce meaningful part correspondence.

Selection Airplane Car Chair Lamp Table
(a) Pointnet - 83.4 74.9 89.6 80.8 80.6
(b) Identity NN 81.3 74.0 86.1 78.4 78.9
(c) Ours unsup NN 81.5 73.9 86.6 78.8 79.2
(d) Ours unsup Best criterion 83.4 74.6 88.4 79.8 79.7
(e) Ours unsup Oracle 87.9 78.9 93.0 93.9 89.3
(f) Ours sup NN 81.2 75.9 86.9 78.4 79.0
(g) Ours sup Best criterion 83.5 76.4 88.8 79.3 79.9
(h) Ours sup Oracle 88.0 80.2 93.1 93.4 89.4
Table 2: Supervised segmentation:. We compare our approach with (a) Pointnet [QSMG16] and (b) Identity baseline. Our approach can be trained with part supervision (f, g, h) or without (c, d, e). Given a target shape and the segmented train set, we compare 3 types of source shapes : (b, c, f) ’s Nearest Neighbors; (d, g) the best shape among all criteria see 4.3; and (e, h) the a posteriori best shape over all train sample. A voting strategy is used on the top 10 shapes in (b, c, d, f, g). The mean IoU is reported.

5.2.3 Selection criteria and voting strategy

Figure 7 shows a quantitative comparison on all criteria, on all category for the identity baseline and our approach using a voting strategy with different number of shapes. The oracle, and PointNet performances are also reported. The Deformation Distance criterion outperforms all other criteria but remains far from the oracle. The oracle performs better than the PointNet baseline across all categories. As a sanity check, we observe that our method outperforms the identity baseline in all settings, showing that it helps to apply our method to transfer labels from to .

Figure 7 also confirms that using several source shapes is beneficial when many annotated examples are available. In the limit, when all source shapes vote and selection criterion does not matter anymore, an average labelling is predicted with poor performances, which again outlines the importance of source selection. Using nine source shapes performs the best across most criteria and categories when all the training annotations can be used.

5.3 Ablation Study

In this section we conduct an ablation study to empirically validate our approach. Table 3 shows performances without the cycle loss, without Chamfer loss, and without any specific triplet sampling strategy during training, simply selecting random shapes.

Table 3 shows that the cycle consistency loss is critical to the success of our method (relative drop of in IoU). Training without Chamfer distance as a reconstruction loss performs slightly better than the identity baseline and below our approach. This highlight the fact that the cycle consistency loss also acts as a reconstruction loss. Finally, our triplet sampling strategy during training provides a small boost.

Car/100 shots Nearest Neighbor Oracle
(a) Identity 67.60 73.59
(b) Ours 68.19 75.87
(c) Ours w/o cycle loss 52.78 59.63
(d) Ours w/o chamfer 66.21 74.31

(e) Ours w/o knn restriction

67.70 75.23
Table 3: Ablation Study:. Given a target shape and 100 segmented train samples, we select ’s nearest neighbors (1st column), and the oracle source shape which maximizes performances for our approach . (2nd column). We compare (a) the identity baseline, with (b) our approach, trained without label supervision, and (c, d, e) its ablations. The mean IoU is reported. Results are computed on the Car category.

5.4 Hyperparameter Study

Figure 8: Hyperparameter study. Study of the influence of the cycle consistency loss from not having it (absciss point "0") to having only the cycle loss (absciss point "inf"). For each target shape, we use the Nearest Neighbors (see  4.3) criterion to select sources from the full training set. A voting strategy is used on the top 10 source shapes. The mean IoU is reported

Figure 8 demonstrates once more that the cycle-consistency loss is the pivotal insight of our method. It also outlines the stability of the results for different weightings of our losses. Note how performances are maintained even in the extreme case with only the cycle-consistency loss. Indeed, the identity function is not a trivial minimum of the cycle consistency loss because of the projection step.

6 Conclusion

We have presented a method for learning a parametric transformation between two surfaces that leverages cycle-consistency as a supervisory signal to predict meaningful correspondences. Our method does not require an object template, can operate without any inter-shape correspondences supervision, and does not assume the deformation is nearly isometric. We demonstrate that our method is able to transfer segmentation labels from a very small number of labeled examples significantly better than state-of-the-art methods, and match the segmentation performance when a larger training dataset is provided.

We believe that the large gap between our performance and the “oracle shape” which provides maximal accuracy shows that using learned deformations to transfer labels, investigating ways to better understand what source models should be selected and new ways to aggregate information across multiple sources is a very promising research direction.


  • [ACBCO17] Azencot O., Corman E., Ben-Chen M., Ovsjanikov M.: Consistent functional cross field design for mesh quadrangulation. ACM Trans. Graph. 36, 4 (July 2017), 92:1–92:13.
  • [BBK06] Bronstein A. M., Bronstein M. M., Kimmel R.: Generalized multidimensional scaling: A framework for isometry-invariant partial surface matching. Proceedings of the National Academy of Sciences 103, 5 (2006), 1168–1172.
  • [BM92] Besl P. J., McKay N. D.: A method for registration of 3-d shapes. IEEE Trans. Pattern Anal. Mach. Intell. 14, 2 (Feb. 1992), 239–256.
  • [BMRB16] Boscaini D., Masci J., Rodolà E., Bronstein M. M.: Learning shape correspondence with anisotropic convolutional neural networks. CoRR abs/1605.06437 (2016).
  • [BR07] Brown B., Rusinkiewicz S.: Global non-rigid alignment of 3-D scans. ACM Transactions on Graphics (Proc. SIGGRAPH) 26, 3 (Aug. 2007).
  • [CFG15] Chang A. X., Funkhouser T. A., Guibas L. J., Hanrahan P., Huang Q., Li Z., Savarese S., Savva M., Song S., Su H., Xiao J., Yi L., Yu F.: Shapenet: An information-rich 3d model repository. CoRR abs/1512.03012 (2015).
  • [CK15] Chen Q., Koltun V.: Robust nonrigid registration by convex optimization. ICCV (2015).
  • [EBC17] Ezuz D., Ben-Chen M.: Deblurring and denoising of maps between shapes. Comput. Graph. Forum 36, 5 (Aug. 2017), 165–174.
  • [GFK18a] Groueix T., Fisher M., Kim V. G., Russell B., Aubry M.: 3d-coded : 3d correspondences by deep deformation. In ECCV (2018).
  • [GFK18b] Groueix T., Fisher M., Kim V. G., Russell B., Aubry M.: AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation. In

    Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)

  • [HAWG08] Huang Q., Adams B., Wicke M., Guibas L. J.: Non-rigid registration under isometric deformations. In Computer Graphics Forum (2008), vol. 27, pp. 1449–1457.
  • [HG13] Huang Q.-X., Guibas L.: Consistent shape maps via semidefinite programming. In Proceedings of the Eleventh Eurographics/ACMSIGGRAPH Symposium on Geometry Processing (Aire-la-Ville, Switzerland, Switzerland, 2013), SGP ’13, Eurographics Association, pp. 177–186.
  • [HKC18] Huang H., Kalogerakis E., Chaudhuri S., Ceylan D., Kim V. G., Yumer E.: Learning local shape descriptors from part correspondences with multi-view convolutional networks. Transactions on Graphics (2018).
  • [HZG12] Huang Q.-X., Zhang G.-X., Gao L., Hu S.-M., Butscher A., Guibas L.: An optimization approach for extracting and encoding consistent maps in a shape collection. ACM Trans. Graph. 31, 6 (Nov. 2012), 167:1–167:11.
  • [JSZ15] Jaderberg M., Simonyan K., Zisserman A., et al.: Spatial transformer networks. In Advances in neural information processing systems (2015), pp. 2017–2025.
  • [KAMC17] Kalogerakis E., Averkiou M., Maji S., Chaudhuri S.: 3D shape segmentation with projective convolutional networks. In Proc. IEEE Computer Vision and Pattern Recognition (CVPR) (2017).
  • [KB14] Kingma D. P., Ba J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • [KHS10] Kalogerakis E., Hertzmann A., Singh K.: Learning 3D Mesh Segmentation and Labeling. ACM Transactions on Graphics 29, 3 (2010).
  • [KLF11] Kim V. G., Lipman Y., Funkhouser T.: Blended intrinsic maps. Transactions on Graphics (Proc. of SIGGRAPH), 4 (2011).
  • [KLM12] Kim V. G., Li W., Mitra N. J., DiVerdi S., Funkhouser T.: Exploring Collections of 3D Models using Fuzzy Correspondences. Transactions on Graphics (Proc. of SIGGRAPH), 4 (2012).
  • [KTEM18] Kanazawa A., Tulsiani S., Efros A. A., Malik J.: Learning category-specific mesh reconstruction from image collections.
  • [LRR17] Litany O., Remez T., Rodolà E., Bronstein A. M., Bronstein M. M.: Deep functional maps: Structured prediction for dense shape correspondence. CoRR abs/1704.08686 (2017).
  • [LSD18] Li L., Sung M., Dubrovina A., Yi L., Guibas L. J.: Supervised fitting of geometric primitives to 3d point clouds. CVPR (2018).
  • [LSP08] Li H., Sumner R. W., Pauly M.: Global correspondence optimization for non-rigid registration of depth scans. Computer Graphics Forum (Proc. SGP’08) 27, 5 (July 2008).
  • [MWZ14] Mitra N. J., Wand M., Zhang H., Cohen-Or D., Kim V. G., Huang Q.-X.: Structure-Aware Shape Processing. SIGGRAPH Course notes (2014).
  • [MZC19] Mo K., Zhu S., Chang A., Yi L., Tripathi S., Guibas L., Su H.: PartNet: A large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding.
  • [NBCW11] Nguyen A., Ben-Chen M., Welnicka K., Ye Y., Guibas L.: An optimization approach to improving collections of shape maps. In Computer Graphics Forum (2011), vol. 30, Wiley Online Library, pp. 1481–1491.
  • [OBCS12] Ovsjanikov M., Ben-Chen M., Solomon J., Butscher A., Guibas L.: Functional maps: A flexible representation of maps between shapes. ACM Trans. Graph. 31, 4 (July 2012), 30:1–30:11.
  • [OMMG10] Ovsjanikov M., Mérigot Q., Mémoli F., Guibas L. J.: One point isometric matching with the heat kernel. Comput. Graph. Forum 29, 5 (2010), 1555–1564.
  • [QSMG16] Qi C. R., Su H., Mo K., Guibas L. J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. arXiv preprint arXiv:1612.00593 (2016).
  • [QYSG17] Qi C. R., Yi L., Su H., Guibas L. J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413 (2017).
  • [RL01] Rusinkiewicz S., Levoy M.: Efficient variants of the icp algorithm. In Proceedings Third International Conference on 3-D Digital Imaging and Modeling (2001).
  • [ROA13] Rustamov R. M., Ovsjanikov M., Azencot O., Ben-Chen M., Chazal F., Guibas L.: Map-based exploration of intrinsic shape differences and variability. ACM Trans. Graph. 32, 4 (July 2013), 72:1–72:12.
  • [RPWO18] Ren J., Poulenard A., Wonka P., Ovsjanikov M.: Continuous and orientation-preserving correspondences via functional maps. ACM Trans. Graph. 37, 6 (Dec. 2018), 248:1–248:16.
  • [vKZHCO11] van Kaick O., Zhang H., Hamarneh G., Cohen-Or D.: A survey on shape correspondence. Computer Graphics Forum 30, 6 (2011), 1681–1707.
  • [WHC16] Wei L., Huang Q., Ceylan D., Vouga E., Li H.: Dense human body correspondences using convolutional networks. In Computer Vision and Pattern Recognition (CVPR) (2016).
  • [WSL18] Wang Y., Sun Y., Liu Z., Sarma S. E., Bronstein M. M., Solomon J. M.: Dynamic graph CNN for learning on point clouds. CoRR abs/1801.07829 (2018).
  • [WZL18] Wang N., Zhang Y., Li Z., Fu Y., Liu W., Jiang Y.-G.: Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV (2018).
  • [WZS19] Wang X., Zhou B., Shi Y., Chen X., Zhao Q., Xu K.: Shape2motion: Joint analysis of motion parts and attributes from 3d shapes. In CVPR (2019), p. to appear.
  • [YKC16] Yi L., Kim V. G., Ceylan D., Shen I.-C., Yan M., Su H., Lu C., Huang Q., Sheffer A., Guibas L.: A scalable active framework for region annotation in 3d shape collections. SIGGRAPH Asia (2016).
  • [YLZ19] Yu F., Liu K., Zhang Y., Zhu C., Xu K.: Partnet: A recursive part decomposition network for fine-grained and hierarchical shape segmentation. In CVPR (2019), p. to appear.
  • [Zha94] Zhang Z.: Iterative point matching for registration of free-form curves and surfaces, 1994.
  • [ZKA16] Zhou T., Krähenbühl P., Aubry M., Huang Q., Efros A. A.: Learning dense correspondence via 3d-guided cycle consistency. In Computer Vision and Pattern Recognition (CVPR) (2016).
  • [ZPIE17] Zhu J.-Y., Park T., Isola P., Efros A. A.:

    Unpaired image-to-image translation using cycle-consistent adversarial networks.

    In Computer Vision (ICCV), 2017 IEEE International Conference on (2017).
  • [ZPK18] Zhou Q.-Y., Park J., Koltun V.: Open3D: A modern library for 3D data processing. arXiv:1801.09847 (2018).
  • [ZSCO08] Zhang H., Sheffer A., Cohen-Or D., Zhou Q., van Kaick O., Tagliasacchi A.: Deformation-drive shape correspondence. Computer Graphics Forum (Special Issue of Symposium on Geometry Processing) 27, 5 (2008), 1431–1439.