BAE-NET: Branched Autoencoder for Shape Co-Segmentation

03/27/2019 ∙ by Zhiqin Chen, et al. ∙ 10

We treat shape co-segmentation as a representation learning problem and introduce BAE-NET, a branched autoencoder network, for the task. The unsupervised BAE-NET is trained with all shapes in an input collection using a shape reconstruction loss, without ground-truth segmentations. Specifically, the network takes an input shape and encodes it using a convolutional neural network, whereas the decoder concatenates the resulting feature code with a point coordinate and outputs a value indicating whether the point is inside/outside the shape. Importantly, the decoder is branched: each branch learns a compact representation for one commonly recurring part of the shape collection, e.g., airplane wings. By complementing the shape reconstruction loss with a label loss, BAE-NET is easily tuned for one-shot learning. We show unsupervised, weakly supervised, and one-shot learning results by BAE-NET, demonstrating that using only a couple of exemplars, our network can generally outperform state-of-the-art supervised methods trained on hundreds of segmented shapes.



There are no comments yet.


page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Co-segmentation takes a collection of data sharing some common characteristic and produces a consistent segmentation of each data item. Specific to shape co-segmentation, the common characteristic of the input collection is typically tied to the common category that the shapes belong to, e.g., they are all lamps or chairs. The significance of the problem is attributed to the consistency requirement, since a shape co-segmentation not only reveals the structure of each shape but also a structural correspondence across the entire shape set, enabling a variety of applications including attribute transfer and mix-n-match modeling.

In recent years, many deep neural networks have been developed for segmentation [3, 21, 25, 31, 41, 63]. Most methods to date formulate segmentation as a supervised classification problem and are applicable to segmenting a single input. Representative approaches include SegNet [3] for images and PointNet [40] for shapes, where the networks are trained by supervision with ground-truth segmentations to map pixel or point features to segment labels.

Co-segmentation seeks a structural understanding or explanation of an entire set. If one were to abide by Occam’s razor, then the best explanation would be the simplest one. This motivates us to treat co-segmentation as a representation learning problem, with the added bonus that such learning may be unsupervised without any ground-truth labels. Given the strong belief that object recognition by humans is part-based [14, 15], the simplest explanation for a collection of shapes belonging to the same category would be a combination of universal parts for that category, e.g., chair backs or airplane wings. Hence, an unsupervised shape co-segmentation would amount to finding the simplest part representations for a shape collection. Our choice for the representation learning module is a variant of autoencoder.

Figure 1: Unsupervised co-segmentation by Bae-Net on the lamp category from the ShapeNet part dataset [62]. Each color denotes a part labeled by a specific branch of our network.

In principle, autoencoders learn compact representations of a set of data via dimensionality reduction while minimizing a self-reconstruction loss. To learn shape parts, we introduce a branched version of autoencoders, where each branch is tasked to learn a simple representation for one universal part of the input shape collection. In the absence of any ground-truth segmentation labels, our branched autoencoder, or Bae-Net for short, is trained to minimize a shape (rather than label) reconstruction loss, where shapes are represented using implicit fields [6]. Specifically, the Bae-Net decoder takes as input a concatenation between a point coordinate and an encoding of the input shape (from the Bae-Net encoder) and outputs a value which indicates whether the point is inside/outside the shape.

Figure 2: Network architecture of the Bae-Net

decoder; encoder is a CNN. L3 is the branch output layer (one neuron per branch) that gives the implicit field for each branch. The final output layer groups the branch outputs, via max pooling, to form the final implicit field. Each branch could represent a shape part, or simply output nothing if all the parts are represented by other branches. The max pooling operator allows part overlap, giving

Bae-Net the freedom to represent each part in the most natural or simplest way. All the colors for the parts are for visualization only.

The Bae-Net architecture is shown in Figure 2, where the encoder employs a traditional convolutional neural network (CNN). The encoder-decoder combination of Bae-Net is trained with all shapes in the input collection using a (shape) reconstruction loss. Appending the point coordinate to the decoder input is critical since it adds spatial awareness to the reconstruction process, which is often lost in the convolutional features from the CNN encoder. Each neuron in the third layer (L3) is trained to learn the inside/outside status of the point relative to one shape part. The parts are learned in a joint space of shape features and point locations. In Section 4, we will show that the limited neurons in our autoencoder for representation learning, and the linearities modeled by the neurons, allow Bae-Net to learn compact part representations in the decoder branches in L3. Figure 3 shows a toy example to illustrate the joint space and contrasts the part representations that are learned by our Bae-Net versus a classic CNN autoencoder.

Figure 3: A toy example where the input set comprises 2D images of three randomly placed patterns. In our network, when the feature dimension is set to be , learning the features amounts to “sorting” the training images in the joint (2+16)-D space, so that the hyperdimensional shape is simple and easy to represent. As shown on the left, where only one feature dimension is drawn, Bae-Net learns a simple three-part representation exhibiting continuity and linearity in the feature dimension, since we treat image and feature dimensions equally in the network. The CNN model on the right represents the shapes as discrete image blocks without continuity. Section 4.1 shows results on the full image.

Bae-Net has a simple architecture and as such, it can be easily adapted to perform one-shot learning, where only one or few exemplar segmentations are provided. In this case, the number of branches is set according to the exemplar(s) and the shape reconstruction loss is complemented by a label reconstruction loss; see Section 3.1.

We demonstrate unsupervised, weakly supervised, and one-shot learning results for shape co-segmentation on the ShapeNet [5], ShapeNet part [62], and Tags2Parts [34] datasets, comparing Bae-Net with existing supervised and weakly supervised methods. The co-segmentation results from Bae-Net are consistent, without explicitly enforcing any consistency loss in the network. Using only one (resp. two or three) segmented exemplars, the one-shot learning version of Bae-Net outperforms state-of-the-art supervised segmentation methods, including PointNet++ and PointCNN, when the latter are trained on 10% (resp. 20% or 30%), i.e., hundreds, of the segmented shapes.

2 Related work

Our work is most closely related to prior research on unsupervised, weakly supervised, and semi-supervised co-segmentation of 3D shapes. Many of these methods, especially those relying on statistical learning, are inspired by related research on 2D image segmentation.

Image co-segmentation without strong supervision.  One may view unsupervised image co-segmentation as pixel-level clustering, guided by color similarity within single images as well as consistency across image collections. Existing approaches employ graph cuts [43], discriminative clustering [23, 24], correspondences [44, 45], cooperative games [30], and deep neural networks [16, 55, 59, 64].

An alternative research theme utilizes some weak cues, such as image-level labels. In this weakly supervised setup, the goal is to find those image regions that strongly correlate with each label, using either traditional statistical models [7, 51, 54] or newer ones based on deep networks [1, 9, 11, 22, 28, 35, 37, 38, 39, 46, 56, 65]. Other forms of weak supervision include bounding boxes [8, 17, 26, 36, 42, 66] and textual captions [4, 10]. Semi-supervised methods assume that full supervision is available for a few images, while the rest are unsupervised: such approaches have been applied to image co-segmentation, e.g. [32, 58].

In contrast to all the above methods, we develop an unsupervised co-segmentation approach for geometric shapes, without color data. Thus, we concentrate on efficient modeling of spatialvariance. We employ a novel encode-and-reconstruct scheme, where each branch of a deep network learns to localize instances of a part across multiple examples in order to compactly represent them, and re-assemble them into the original shape. Our method easily adapts to weakly- and semi-supervised scenarios.

Our method is critically dependent on the choice of network architecture. The relatively shallow fully-connected stack is high-capacity enough to model non-trivial parts, but shallow enough that the neurons are forced to learn a compact, efficient representation of the shape space, in terms of recurring parts carved out by successive simple units. Thus, the geometric prior is inherent in the architecture itself, similar in spirit to deep image priors [53].

3D segmentation without strong supervision.  Building on substantial prior work on single-shape mesh segmentation based on geometric cues [47], the pioneering work of Golovinskiy and Funkhouser [12] explored consistent co-segmentation of 3D shapes by constructing a graph connecting not just adjacent polygonal faces in the same mesh, but also corresponding faces across different meshes. A normalized cut of this graph yields a joint segmentation. Subsequently, several papers developed alternative unsupervised clustering strategies for mesh faces, given a handcrafted similarity metric induced by a feature embedding or a graph [18, 20, 33, 50, 60]. Shu et al. [49]

modified this setup by first transforming handcrafted local features with a stacked autoencoder before applying an (independently learned) Gaussian mixture model and per-shape graph cuts. In contrast, our method is an end-to-end differentiable pipeline that requires no manual feature specification or large-scale graph optimization.

Tulsiani et al. [52] proposed an unsupervised method to approximate a 3D shape with a small set of primitive parts (cuboids), inducing a segmentation of the underlying mesh. Their approach has similarities to ours – they predict part cuboids with branches from an encoder network, and impose a reconstruction loss to make the cuboid assembly resemble the original shape. However, the critical difference is the restriction to cuboidal boxes: they cannot accommodate complex, non-convex part geometries such as the rings in Figure 4 and the groups of disjoint lamp parts in Figure 1, for which a nonlinear stack of neurons is a much more effective indicator function.

In parallel, approaches were developed for weakly supervised [34, 48] and semi-supervised [19, 27, 57] shape segmentation. Shapes formed by articulation of a template shape can be jointly co-segmented [2, 61]. Our method does not depend on any such supervision or base template, although it can optionally benefit from one or two annotated examples to separate strongly correlated part pairs.

3 Bae-Net: architecture, loss, and training

The architecture of Bae-Net draws inspiration from the recently introduced implicit decoder of Chen and Zhang [6]

. Their network learns an implicit field by means of a binary classifier, which is similar to

Bae-Net. The main difference, as shown in Figure 2, is that Bae-Net is designed to segment a shape into different parts by reconstructing the parts in different branches of the network.

Similar to [6]

, we use a traditional convolutional neural nework as the encoder to produce the feature code for a given shape. We also adopt a three-layer fully connected neural network as our decoder network. The network takes a joint vector of point coordinates and feature code as input, and outputs a value in each output branch that indicates whether the input point is inside a part (1) or not (0). Finally, we use a max pooling operator to merge parts together and obtain the entire shape, which allows our segmented parts to overlap. We use “L1”, “L2” and “L3” to represent the first, second, and third layer, respectively. The different network design choices will be discussed in Section 


3.1 Network losses for various learning scenarios


For each input point coordinate, our network outputs a value that indicates the likelihood that the given point is inside the shape. We train our network with sampled points in the 3D space surrounding the input shape and the inside-outside status of the sampled points. After sampling points for input shapes using the same method as  [6], we can train our autoencoder with a mean square loss:


where is the distribution of training shapes, is the distribution of sampled points given shape , is the output value of our decoder for input point , and is the ground truth inside-outside status for point

. This loss function allows us to reconstruct the target shape in the output layer. The segmented parts will be expressed in the branches of L3, since the final output is taken as the maximum value over the fields represented by those branches.


If we have examples with ground truth part labels, we can also train Bae-Net in a supervised way. Denote as the ground truth status for point , and as the output value of the -th branch in L3. For a network with branches in L3, the supervised loss is:


In datasets such as the ShapeNet part dataset [62], shapes are represented by point clouds sampled from their surfaces. In such datasets, the inside-outside status of a point can be ambiguous. However, since our sampled points are from voxel grids and the reconstructed shapes are thicker than the original, we can assume all points in the ShapeNet part dataset are inside our reconstructed shapes. We use both our sampled points from voxel grids and the point clouds in the ShapeNet part dataset, by modifying the loss function:


where is the distribution for our sampled points from voxel grids, and is the distribution of points in the ShapeNet part dataset. We set to 1 in our experiments.

One-shot learning.

Our network also supports one-shot training, where we feed the network one (or 2, 3…) shapes with ground truth part labels, and other shapes without part labels. To enable one-shot training, we have a joint loss:


where is the distribution of all shapes, and is the distribution of the few given shapes with part labels. We do not explicitly use this loss function or set . Instead, we train the network using the unsupervised and supervised losses alternately. We do one supervised training iteration after every 4 unsupervised training iterations.

Additionally, we add a very small regularization term for the parameters of L3, so as to prevent unnecessary overlap, e.g., when the part output by one branch contains the part output by another branch.

3.2 Point label assignment

After training, we get an approximate implicit field for the input shape. To label a given point of an input shape, we simply feed the point into the network together with the code encoding the feature of the input shape, and label the point by looking at which branch in L3 gives the highest value. If the training has exemplar shapes as guidance, each branch will be assigned a label automatically with respect to the exemplars. If the training is unsupervised, we need to look at the branch outputs and give a label for each branch by hand. For example in Figure 2, we can label branch #3 as “jet engine”, and each point having the highest value in this branch will be labeled as “jet engine”. To label a mesh model, we first subdivide the surfaces to obtain fine-grained triangles, and assign a label for each triangle. To label a triangle, we feed its three vertices into the network and sum their output values in each branch, and assign the label whose associated branch gives the highest value.

3.3 Training details

In the following, we denote the decoder structure by the width (number of neurons) of each layer as { L1-L2-L3 }. The encoders for all tasks are standard CNN encoders.

In the 2D shape extraction experiment, we set the feature vector to be 16-D, since the goal of this experiment is to explain why and how the network works and the shapes are easy to represent. We used the same width for all hidden layers since it is easier for us to compare models with different depths. For images, the decoder is { 256-4 } for 2-layer model, { 256-256-4 } for 3-layer model, and { 256-256-256-4 } for 4-layer model. For images, we use as the width instead of .

In all other experiments, our encoder takes voxels as input, and produces a -D feature code. We sample points from each shape’s voxel model, and use these point-value pairs to compute the unsupervised loss .

For unsupervised tasks, we set the decoder network to { 3072-384-12 } and train 200,000 iterations with mini-batches of size 1, which takes 2 hours per category. For one-shot tasks, we use a { 1024-256- } decoder, where is the number of ground truth parts in exemplar shapes. The decoder is lighter, hence we finish training in a much shorter time. For each category, we train 100,000 iterations: on all 15 categories this takes less than 10 hours total on a single NVIDIA GTX 1080Ti GPU. We also find that doing 3,000-4,000 iterations of supervised training before alternating it with unsupervised training improves the results. We plan to release all code for this project.

4 Experiments and results

In this section, we show qualitative and quantitative segmentation results in various settings with our Bae-Net and compare them to those from state-of-the-art methods. But first, in Section 4.1, we discuss different architecture design choices and offer insights into how our network works.

4.1 Network design choices and insights

Figure 4: Independent shape extraction results of different models. The first three rows show the segmentation results of the 3-layer model. The next three rows show the segmentation results of other models for comparison. The last row shows the extrapolation results continuing its previous row. Note that no shape patterns go beyond the boundary in our synthesized training dataset, thus we can be certain that some shapes in the last row are completely new.
Figure 5: Visualization of neurons in the first, second and third layer of our 3-layer network. Since L1 and L2 have hundreds of neurons, we only select a few representative ones to show here. More visualizations can be found in the supplementary material.

We first explain our network design choices in detail, as illuminated by two synthetic 2D datasets, “elements” and “triple rings”. “Elements” is synthesized by putting three different shape patterns over images, where the cross is placed randomly on the image, the triangle is placed randomly on a horizontal line in the middle of the image, and the rhombus is placed randomly on a vertical line in the middle of the image. “Triple rings” is synthesized by placing three rings of different sizes randomly over images. See Figure 4 for some examples.

First, we train Bae-Net with 4 branches on the two datasets; see some results in Figure 4. Our network successfully separated the shape patterns, even when two patterns overlap. Further, each of the output branches only outputs one specific shape pattern, thus we also obtain a shape correspondence from the co-segmentation process.

We visualize the neuron activations in Figure 5

. In L1, the point coordinates and the shape feature code have gone through a linear transform and a leaky ReLU activation, therefore the activation maps in L1 are linear “space dividers” with gradients. In L2, each neuron linearly combines the fields in L1 to form basic shapes. The combined shapes are mostly convex: although non-convex shapes can be formed, they will need more resources (L1 neurons) than simpler shapes. This is because L2 neurons calculate a weighted sum of the values in L1 neurons, not MIN, MAX, or a logical operation, thus each L1 neuron brings a global, rather than local, change in L2. L2 represents higher level shapes than L1, therefore we can expect L1 to have many more neurons than L2, and we incorporate this idea in our network design for shape segmentation, to significantly shorten training time. The L3 neurons further combine the shapes in L2 to form output parts in our network, and our final output combines all L3 outputs via max pooling.

After understanding how the network works, we will be able to explain why our network tends to output segmented, corresponding parts in each branch. For a single shape, the network has limited representation ability in L3, therefore it prefers to construct simple parts in each branch, and let our max pooling layer combine them together. This allows better reconstruction quality than reconstructing the whole shape in just one branch. With an appropriate optimizer to minimize the reconstruction error, we can get well-segmented parts in the output branches.

For part correspondence, we need to also consider the input shape feature codes. As shown in Figure 2, our network treats the feature code and point coordinates equally. This allows us to consider the whole decoder as a hyperdimensional implicit field, in a joint space made by both image dimensions (input point coordinates) and shape feature dimensions (shape feature code). In Figure 3, we visualize a 3D slice of of this implicit field, with two image space dimensions and one feature dimension. Our network is trying to find the best way to represent all shapes in the training set, and the easiest way is to arrange the training shapes so that the hyperdimensional shape is continuous in the feature dimension, as shown in the figure. This encourages the network to learn to represent the final complex hyperdimensional shape as a composition of a few simple hyperdimensional shapes. In Figure 4

, we show how our trained network can accomplish smooth interpolation and extrapolation of shapes. Our network is able to simultaneously accomplish segmentation and correspondence.

We compare the segmentation result of our current 3-layer model with a 2-layer model, a 4-layer model and a CNN model in Figure 4. (Detailed network parameters are in supplementary material.) The 2-layer model has a hard time reconstructing the rings, since L2 is better at representing convex shapes. The 4-layer model can separate parts, but since most shapes can already be represented in L3, the extra fourth layer does not necessarily output separated parts. One can easily construct an L4 layer on top of our 3-layer model to output the whole shape in one branch and leave the other branches blank. The CNN model is not sensitive to parts and outputs basically everything or nothing in each branch, because there is no bias towards sparsity or segmentation. Through our experiments, the network with 3 layers is the best choice for independent shape extraction, which makes it a perfect candidate for unsupervised and weakly supervised shape segmentation.

4.2 Evaluation of unsupervised learning

Figure 6: Unsupervised segmentation results by Bae-Net. The first three rows show segmentation results on bench, couch, car, rifle, chair and table respectively. In the last row, we show the results when merging chair and table into a joint dataset and training on it. Since our model generates a field for each part, we render the original meshes with different colors representing different parts.
Shape (#parts) airplane (3) bag (2) cap (2) chair (3) chair* (4) mug (2) skateboard (2) table (2)
Segmented body, tail, body, panel, back+seat, back, seat, body, deck, top,
parts wing+engine handle peak leg, arm leg, arm handle wheel+bar leg+support
IOU 61.1 82.5 87.3 65.5 83.7 93.4 63.5 78.7
mod-IOU 80.4 82.5 87.3 86.6 83.7 93.4 88.1 87.0
Table 1: Quantitative results by Bae-Net on the ShapeNet part dataset [62] by IOU meansured against ground-truth parts. Chair* is chair training on chair+table joint set. mod-IOU, or modified IOU, is IOU measured against both parts and part combinations in the ground truth; it is more tolerant with coarse segmentations, e.g., combining the back and seat of a chair. Higher IOU indicates better performance.

We first test unsupervised co-segmentation over 20 categories, where 16 of them are from the ShapeNet part dataset  [62]. These categories and the number of shapes are: planes (2,690), bags (76), caps (55), cars (898), chairs (3,758), earphones (69), guitars (787), knives (392), lamps (1,547), laptops (451), motors (202), mugs (184), pistols (283), rockets (66), skateboards (152), and tables (5,271). The 4 extra categories, benches (1,816), rifles (2,373), couches (3,173), and vessels (1,939), are from ShapeNet [5].

We train individual models for different shape categories. Some results are shown in Figures 1 and 6, with more in the supplemental material. Reasonable parts are obtained, and each branch of Bae-Net only outputs a specific part, giving us natural part correspondences. Our unsupervised segmentation is not guaranteed to produce the same part counts as those in the ground truth; it tends to produce coarser segmentation results, e.g., combining the seat and back of a chair. Since coarser segmentations are not necessarily wrong results, in Table 1, we report two sets of Intersection over Union (IOU) numbers which compare segmentation results by Bae-Net and the ground truth, one allowing part combinations and the other not.

Although unsupervised Bae-Net does not separate chair back and seat when training on the chair category, it can do so when tables are added for training. Also, it successfully corresponds chair seats with tabletops, chair legs with table legs; see Figure 6. This leads to a weakly supervised way of segmenting target parts, as we discuss next.

4.3 Comparison with Tags2Parts

Figure 7: Weakly supervised segmentation results on the Tags2Parts datasets [34]. The top row visualizes the implicit field of each branch by its 0.4-isosurface; different colors represent outputs from different branches. The visualization can be misleading as the field is not necessarily zero in empty areas. The middle row shows actual labelings assigned by the implicit fields: the target parts are in blue. The bottom row shows segmentation results of unsupervised training, i.e., we do not change the shape distribution of the given dataset by the per-shape labels. Some parts are not separated compared to weakly supervised results.
Arm Back Engine Sail Head
Tags2Parts [34] 0.71 0.79 0.46 0.84 0.37
Bae-Net 0.94 0.85 0.88 0.92 0.76
Table 2: Comparison with Tags2Parts [34] on their datasets by AUC (higher number = better performance). Bae-Net outperforms [34] in every category, even though our network did not use the provided per-shape labels explicitly.

Now we compare our Bae-Net with a state-of-the-art weakly supervised part labeling network, Tags2Parts [34]. Given a shape dataset and a binary label for each shape indicating whether a target part appears in it or not, Tags2Parts can separate out the target parts, with the binary labels as weak supervision. Our network can do the same task with even weaker supervision. We do not pass the labels to the network or incorporate them into the loss function. Instead, we use the labels to change the training data distribution.

Our intuition is that, if two parts always appear together and combine in the same way, like chair back and seat, treating them as one single part is more efficient for the network to reconstruct them. But when we change the data distribution, say letting only 20% of the shapes have the target part (such as chair backs), it will be more natural for the network to separate the two parts and reconstruct the target part in a single branch. Therefore, we add weak supervision by simply making the number of shapes that do not have the target part four times as many as the shapes that have the target part, by duplicating the shapes in the dataset.

We used the dataset provided by [34]

, which contains six part categories: (chair) armrest, (chair) back, (car) roof, (airplane) engine, (ship) sail, and (bed) head. We run our method on all categories except for car roof, since it is a flat surface part that our network cannot separate. We used the unsupervised version of our network to perform the task, first training for a few epochs using the distribution altered dataset for initialization, then only training our network on those shapes that have target parts to refine the results.

To compare results, we used the same metric as in [34]

: Area Under the Curve (AUC) of precision/recall curves. For each test point, we get its probability of being in each part by normalizing the branch outputs with a unit sum. Quantitative and visual results are shown in Table 

2 and Figure 7. Note that some parts, e.g., plane engines, that cannot be separated when training on the original dataset are segmented when training on the altered dataset. Our network particularly excels at segmenting planes, converging to eight effective branches representing body, left wing, right wing, engine and wheel on wings, jet engine and front wheel, vertical stabilizer, and two types of horizontal stabilizers.

4.4 One-shot training vs supervised methods

Figure 8: One-shot segmentation results by Bae-Net, with one segmented exemplar (blue box). See examples of other categories and 2/3-shot training results in the supplementary material.
1-exem. vs. 2-exem. vs. 3-exem. vs.
10% train set 20% train set 30% train set
Pointnet [40] 72.1 73.0 74.6
Pointnet++ [41] 69.6 75.4 76.6
PointCNN [29] 58.0 65.6 65.7
SSCN [13] 56.7 61.0 64.6
Our Bae-Net 74.1 76.4 77.4
Table 3: Quantitative comparison to supervised methods by average IOU over 15 shape categories, without combining parts in the ground truth. Our one-shot learning with 1/2/3 exemplars outperforms supervised methods trained on 10%/20%/30% of the shapes respectively (on average each category has 765 training shapes).
plane bag cap chair earph. guitar knife lamp laptop motor. mug pistol rocket skate. table Mean
Pointnet [40] 76.1 69.8 62.6 86.0 62.1 86.2 79.7 73.6 93.3 59.1 83.4 75.9 41.8 57.7 74.8 72.1
Pointnet++ [41] 76.4 43.4 77.8 87.5 67.7 87.4 18.2 71.4 94.1 61.3 90.4 72.8 51.4 68.7 75.3 69.6
PointCNN [29] 73.6 44.6 36.5 86.1 35.1 87.0 80.6 75.9 94.6 16.9 48.6 52.9 22.7 43.8 71.2 58.0
SSCN [13] 74.2 50.9 46.2 84.5 58.6 86.2 76.0 59.6 53.6 25.3 46.0 42.7 25.1 44.3 77.0 56.7
Ours 1-exem. 71.2 73.9 81.6 85.6 62.6 86.4 82.1 64.0 94.5 52.8 94.7 73.8 39.8 70.4 78.1 74.1
Ours 2-exem. 72.9 81.8 84.7 85.0 64.1 88.2 81.3 62.9 94.4 59.4 94.1 76.4 48.2 72.7 80.5 76.4
Ours 3-exem. 74.6 75.2 82.0 85.1 76.1 88.3 82.4 66.4 94.3 63.7 94.6 77.3 47.2 72.8 80.7 77.4
2-layer 1-exem. 61.2 71.8 58.9 84.6 68.0 52.2 26.3 65.4 94.3 28.0 92.3 71.6 31.3 45.5 80.6 62.1
4-layer 1-exem. 72.7 80.7 78.7 81.8 59.7 86.5 37.8 63.5 94.1 57.5 94.5 71.2 41.6 60.5 78.9 70.6
Table 4: Comparison results (average IOU) for each shape category: our Bae-Net 1/2/3-exemplar, vs supervised methods using 10% training shapes. Last two rows show results of two variations of Bae-Net using 2 and 4 layers, respectively.

Finally, we use one, or a few, segmented exemplars to enforce Bae-Net to output designated parts, in order to evaluate our results using ground truth labels, and compare with other methods. In detail, we manually select one, two, or three segmented exemplar shapes from the training set, and train our model on the exemplars using the supervised loss, while training on the whole set using unsupervised loss.

We did not find other semi-supervised segmentation methods that take only a few exemplars and segment shapes in a whole set. Therefore we compare our work with several state-of-the-art supervised methods, namely PointNet [40], PointNet++ [41], PointCNN [29] and SSCN [13]. Since it would be unfair to provide the supervised methods with only 1-3 exemplars to train, as we did for Bae-Net, we train their models using 10%, 20%, or 30% of the shapes from the datasets (average dataset size per category = 765 shapes) for comparison. We evaluate all methods on the ShapeNet part dataset [62] by average IOU (no part combinations are tolerated), and train individual models for different shape categories. We did not include the car category for the same reason as in Section 4.3.

Ideally, we hope to select exemplars that contain all the ground truth parts and are representative of the shape collection, but this can be difficult. For example, there are two kinds of lamps – ground lamps and ceiling lamps – that can both be segmented into three parts. However, the order and labels of the lamp parts are different, e.g., the base of a ground lamp and the base of a ceiling lamp have different labels. Our network cannot infer such semantic information from 1-3 exemplars, thus we selected only one type of lamps (ground lamps) as exemplars, and our network only has three branches (without one for the base of the ceiling lamp). During evaluation, we add a fake branch that only outputs zero, to represent the missing branch.

The qualitative result for the chair category is shown in Figure 8. The quantitative results for all categories are shown in Table 3, and some per-category results in Table 4. Our network training with 1/2/3 exemplars outperforms supervised methods with 10%/20%/30% of the training set. In some categories, our model training on one exemplar already achieves high accuracy. In a few categories, adding more exemplars reduces accuracy, which may be due to the ambiguity brought by the extra exemplars. We also show an ablation study of changing the number of network layers, similar to the one in Sec 4.1. We use { 1024- } and { 1024-256-256- } for the 2-layer and 4-layer models, respectively, and train them with one exemplar in the same setting as our default 3-layer model. The 2-layer model performs poorly on several categories since it lacks the representation ability to reconstruct complex shapes, e.g., shapes with round parts (cap, guitar, motorbike). The 4-layer model has very similar performance to the 3-layer model, but gets lower performance on some categories because it often merges a small part into a nearby larger part. These variations resemble the trends in Figure 4.

5 Conclusion, limitations, and future work

We have developed Bae-Net, a branched autoencoder, for unsupervised, one-shot, and weakly supervised shape co-segmentation. Experiments show that our network can outperform state-of-the-art supervised methods, including PointNet++, PointCNN, etc., using much less training data (1-3 exemplars vs. 77-230 for the supervised methods, average over 15 shape categories). On the other hand, compared to the supervised methods, Bae-Net tends to produce coarser segmentations, which are not necessarily incorrect and can provide a good starting point for further refinement.

Many prior unsupervised co-segmentation methods [12, 18, 20, 50], which are model-driven rather than data-driven, had only been tested on very small input sets (less than 50 shapes). In contrast, Bae-Net can process much larger collections (up to 5,000+ shapes). In addition, unless otherwise noted, all the results shown in the paper were obtained using the default network settings, further validating the generality and robustness of our co-segmentation network.

Bae-Net is able to produce consistent segmentations, over large shape collections, without explicitly enforcing a consistency loss – the consistency is a consequence of the network architecture. That being said, our current method does not provide any theoretical guarantee for segmentation consistency or universal part counts and such extensions could build upon the results of Bae-Net.

For unsupervised segmentation, since we initialize the network parameters randomly and optimize a reconstruction loss, while treating each branch equally, there is no way to predict which branch will eventually output which part. The network may also be sensitive to the initial parameters, where different runs may result in different segmentation results, e.g., combining seat and back vs seat and legs for the chair category. Note however that both results may be acceptable as a coarse structure interpretation for chairs.

Another drawback is that our network groups similar and close-by parts in different shapes for correspondence. This is reasonable in most cases, but for some categories, e.g., lamps or tables, where the similar and close-by parts may be assigned different labels, our network can be confused. How to incorporate shape semantics into Bae-Net is worth investigating. Finally, Bae-Net is much shallower and thinner compared to the network in [6], since we care more about segmentation (not reconstruction) quality. However, the limited depth and width of the network makes it difficult to train on high-resolution models (say ), which hinders us from obtaining fine-grained segmentations.

In future work, besides addressing the issues above, we plan to introduce hierarchies into the shape representation and network structure, since it is more natural to segment shapes in a coarse-to-fine manner. Also, Bae-Net provides basic part separation and correspondence, which could be incorporated when developing generative models.


  • [1] J. Ahn and S. Kwak. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In CVPR, 2018.
  • [2] D. Anguelov, D. Koller, H. Pang, P. Srinivasan, and S. Thrun. Recovering articulated object models from 3D range data. In UAI, 2004.
  • [3] V. Badrinarayanan, A. Kendall, and R. Cipolla. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. TPAMI, 2017.
  • [4] T. Berg, A. Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Learned-Miller, and D. Forsyth. Names and faces in the news. In CVPR, 2004.
  • [5] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. ShapeNet: An information-rich 3D model repository. CoRR, abs/1512.03012, 2015.
  • [6] Z. Chen and H. Zhang. Learning implicit fields for generative shape modeling. In CVPR, 2019.
  • [7] R. G. Cinbis, J. J. Verbeek, and C. Schmid. Weakly supervised object localization with multi-fold multiple instance learning. TPAMI, 39(1), 2017.
  • [8] J. Dai, K. He, and J. Sun. BoxSup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In ICCV, 2015.
  • [9] T. Durand, T. Mordan, N. Thome, and M. Cord.

    Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation.

    In CVPR, 2017.
  • [10] M. Everingham, J. Sivic, and A. Zisserman. Taking the bite out of automatic naming of characters in TV video. In Image and Vision Computing, 2009.
  • [11] W. Ge, S. Yang, and Y. Yu. Multi-evidence filtering and fusion for multi-label classification and object detection and semantic segmentation based on weakly supervised learning. In CVPR, 2018.
  • [12] A. Golovinskiy and T. Funkhouser. Consistent segmentation of 3D models. Computers & Graphics, 33(3), 2009.
  • [13] B. Graham, M. Engelcke, and L. van der Maaten. 3D semantic segmentation with submanifold sparse convolutional networks. In CVPR, 2018.
  • [14] D. O. Hebb. The Organization of Behavior. 1949.
  • [15] D. D. Hoffman and W. A. Richards. Parts of recognition. Cognition, 1984.
  • [16] K.-J. Hsu, Y.-Y. Lin, and Y.-Y. Chuang. Co-attention CNNs for unsupervised object co-segmentation. In Proc. IJCAI, 2018.
  • [17] R. Hu, P. Dollár, K. He, T. Darrell, and R. Girshick. Learning to segment every thing. In CVPR, 2018.
  • [18] R. Hu, L. Fan, and L. Liu. Co-segmentation of 3D shapes via subspace clustering. Computer Graphics Forum, 31(5), 2012.
  • [19] H. Huang, E. Kalogerakis, and B. Marlin.

    Analysis and synthesis of 3D shape families via deep-learned generative models of surfaces.

    Computer Graphics Forum, 34(5), 2015.
  • [20] Q. Huang, V. Koltun, and L. Guibas.

    Joint shape segmentation with linear programming.

    Trans. Graph. (SIGGRAPH Asia), 30(6), 2011.
  • [21] Q. Huang, W. Wang, and U. Neumann. Recurrent slice networks for 3D segmentation of point clouds. In CVPR, 2018.
  • [22] Z. Huang, X. Wang, J. W. W. Liu, and J. Wang. Weakly-supervised semantic segmentation network with deep seeded region growing. In CVPR, 2018.
  • [23] A. Joulin, F. Bach, and J. Ponce. Discriminative clustering for image co-segmentation. In CVPR, 2010.
  • [24] A. Joulin, F. Bach, and J. Ponce. Multi-class cosegmentation. In CVPR, 2012.
  • [25] E. Kalogerakis, M. Averkiou, S. Maji, and S. Chaudhuri. 3D shape segmentation with projective convolutional networks. In CVPR, volume 1, 2017.
  • [26] A. Khoreva, R. Benenson, J. Hosang, M. Hein, and B. Schiele. Simple does it: Weakly supervised instance and semantic segmentation. In CVPR, 2017.
  • [27] V. G. Kim, W. Li, N. J. Mitra, S. Chaudhuri, S. DiVerdi, and T. Funkhouser. Learning part-based templates from large collections of 3D shapes. Trans. Graph. (SIGGRAPH), 32(4), 2013.
  • [28] A. Kolesnikov and C. H. Lampert. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In ECCV, 2016.
  • [29] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen. PointCNN: Convolution on X-transformed points. In NeurIPS, 2018.
  • [30] B.-C. Lin, D.-J. Chen, and L.-W. Chang. Unsupervised image co-segmentation based on cooperative game. In Proc. ACCV, 2014.
  • [31] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  • [32] T. Ma and L. J. Latecki. Graph transduction learning with connectivity constraints with application to multiple foreground cosegmentation. In CVPR, 2013.
  • [33] M. Meng, J. Xia, J. Luo, and Y. He. Unsupervised co-segmentation for 3D shapes using iterative multi-label optimization. CAD, 45(2), 2013.
  • [34] S. Muralikrishnan, V. G. Kim, and S. Chaudhuri. Tags2Parts: Discovering semantic regions from shape tags. In CVPR, 2018.
  • [35] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization for free? – weakly-supervised learning with convolutional neural networks. In CVPR, 2015.
  • [36] G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille.

    Weakly- and semi-supervised learning of a DCNN for semantic image segmentation.

    In ICCV, 2015.
  • [37] D. Pathak, P. Krähenbühl, and T. Darrell. Constrained convolutional neural networks for weakly supervised segmentation. In ICCV, 2015.
  • [38] D. Pathak, E. Shelhamer, J. Long, and T. Darrell. Fully convolutional multi-class multiple instance learning. In ICLR Workshop, 2015.
  • [39] P. O. Pinheiro and R. Collobert. From image-level to pixel-level labeling with convolutional networks. In CVPR, 2015.
  • [40] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: Deep learning on point sets for 3D classification and segmentation. In CVPR, 2017.
  • [41] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, 2017.
  • [42] M. Rajchl, M. C. H. Lee, O. Oktay, K. Kamnitsas, J. Passerat-Palmbach, W. Bai, M. Damodaram, M. A. Rutherford, J. V. Hajnal, B. Kainz, and D. Rueckert. DeepCut: Object segmentation from bounding box annotations using convolutional neural networks. Trans. Med. Imag., 36, 2017.
  • [43] C. Rother, V. Kolmogorov, T. Minka, and A. Blake. Cosegmentation of image pairs by histogram matching – incorporating a global constraint into MRFs. In CVPR, 2006.
  • [44] M. Rubinstein, A. Joulin, J. Kopf, and C. Liu. Unsupervised joint object discovery and segmentation in internet images. In CVPR, 2013.
  • [45] J. C. Rubio, J. Serrat, A. Lopez, and N. Paragios. Unsupervised co-segmentation through region matching. In CVPR, 2012.
  • [46] F. Saleh, M. S. A. Akbarian, M. Salzmann, L. Petersson, S. Gould, and J. M. Alvarez. Built-in foreground/background prior for weakly-supervised semantic segmentation. In ECCV, 2016.
  • [47] A. Shamir. A survey on mesh segmentation techniques. Computer Graphics Forum, 27(6), 2008.
  • [48] P. Shilane and T. Funkhouser. Distinctive regions of 3D surfaces. Trans. Graph., 26(2), 2007.
  • [49] Z. Shu, C. Qi, S. Xin, C. Hu, L. Wang, Y. Zhang, and L. Liu. Unsupervised 3D shape segmentation and co-segmentation via deep learning. CAGD, 43, 2016.
  • [50] O. Sidi, O. van Kaick, Y. Kleiman, H. Zhang, and D. Cohen-Or.

    Unsupervised co-segmentation of a set of shapes via descriptor-space spectral clustering.

    Trans. Graph. (SIGGRAPH Asia), 30(6), 2011.
  • [51] H. O. Song, R. B. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darrell. On learning to localize objects with minimal supervision. In ICML, 2014.
  • [52] S. Tulsiani, H. Su, L. J. Guibas, A. A. Efros, and J. Malik. Learning shape abstractions by assembling volumetric primitives. CVPR, 2017.
  • [53] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Deep image prior. In CVPR, 2018.
  • [54] C. Wang, W. Ren, K. Huang, and T. Tan. Weakly supervised object localization with latent category learning. In ECCV, 2014.
  • [55] C. Wang, B. Yang, and Y. Liao. Unsupervised image segmentation using convolutional autoencoder with total variation regularization as preprocessing. In Proc. ICASSP, 2017.
  • [56] X. Wang, S. You, X. Li, and H. Ma. Weakly-supervised semantic segmentation by iteratively mining common object features. In CVPR, 2018.
  • [57] Y. Wang, S. Asafi, O. Van Kaick, H. Zhang, D. Cohen-Or, and B. Chen. Active co-analysis of a set of shapes. Trans. Graph. (SIGGRAPH Asia), 31(6), 2012.
  • [58] Z. Wang and R. Liu. Semi-supervised learning for large scale image cosegmentation. In ICCV, 2013.
  • [59] X. Xia and B. Kulis. W-Net: A deep model for fully unsupervised image segmentation. CoRR, abs/1711.08506, 2017.
  • [60] K. Xu, H. Li, H. Zhang, D. Cohen-Or, Y. Xiong, and Z.-Q. Cheng. Style-content separation by anisotropic part scales. Trans. Graph. (SIGGRAPH Asia), 29(6), 2010.
  • [61] L. Yi, H. Huang, D. Liu, E. Kalogerakis, H. Su, and L. Guibas. Deep part induction from articulated object pairs. Trans. Graph. (SIGGRAPH Asia), 37(6), 2018.
  • [62] L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, L. Guibas, et al. A scalable active framework for region annotation in 3D shape collections. Trans. Graph. (SIGGRAPH Asia), 35(6), 2016.
  • [63] L. Yi, H. Su, X. Guo, and L. J. Guibas. SyncSpecCNN: Synchronized spectral CNN for 3D shape segmentation. In CVPR, 2017.
  • [64] J. Yu, D. Huang, and Z. Wei. Unsupervised image segmentation via stacked denoising auto-encoder and hierarchical patch indexing. Signal Processing, 143, 2018.
  • [65] X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. Huang. Adversarial complementary learning for weakly supervised object localization. In CVPR, 2018.
  • [66] X. Zhao, S. Liang, and Y. Wei. Pseudo mask augmented object detection. In CVPR, 2018.