1 Introduction
Cosegmentation takes a collection of data sharing some common characteristic and produces a consistent segmentation of each data item. Specific to shape cosegmentation, the common characteristic of the input collection is typically tied to the common category that the shapes belong to, e.g., they are all lamps or chairs. The significance of the problem is attributed to the consistency requirement, since a shape cosegmentation not only reveals the structure of each shape but also a structural correspondence across the entire shape set, enabling a variety of applications including attribute transfer and mixnmatch modeling.
In recent years, many deep neural networks have been developed for segmentation [3, 21, 25, 31, 41, 63]. Most methods to date formulate segmentation as a supervised classification problem and are applicable to segmenting a single input. Representative approaches include SegNet [3] for images and PointNet [40] for shapes, where the networks are trained by supervision with groundtruth segmentations to map pixel or point features to segment labels.
Cosegmentation seeks a structural understanding or explanation of an entire set. If one were to abide by Occam’s razor, then the best explanation would be the simplest one. This motivates us to treat cosegmentation as a representation learning problem, with the added bonus that such learning may be unsupervised without any groundtruth labels. Given the strong belief that object recognition by humans is partbased [14, 15], the simplest explanation for a collection of shapes belonging to the same category would be a combination of universal parts for that category, e.g., chair backs or airplane wings. Hence, an unsupervised shape cosegmentation would amount to finding the simplest part representations for a shape collection. Our choice for the representation learning module is a variant of autoencoder.
In principle, autoencoders learn compact representations of a set of data via dimensionality reduction while minimizing a selfreconstruction loss. To learn shape parts, we introduce a branched version of autoencoders, where each branch is tasked to learn a simple representation for one universal part of the input shape collection. In the absence of any groundtruth segmentation labels, our branched autoencoder, or BaeNet for short, is trained to minimize a shape (rather than label) reconstruction loss, where shapes are represented using implicit fields [6]. Specifically, the BaeNet decoder takes as input a concatenation between a point coordinate and an encoding of the input shape (from the BaeNet encoder) and outputs a value which indicates whether the point is inside/outside the shape.
The BaeNet architecture is shown in Figure 2, where the encoder employs a traditional convolutional neural network (CNN). The encoderdecoder combination of BaeNet is trained with all shapes in the input collection using a (shape) reconstruction loss. Appending the point coordinate to the decoder input is critical since it adds spatial awareness to the reconstruction process, which is often lost in the convolutional features from the CNN encoder. Each neuron in the third layer (L3) is trained to learn the inside/outside status of the point relative to one shape part. The parts are learned in a joint space of shape features and point locations. In Section 4, we will show that the limited neurons in our autoencoder for representation learning, and the linearities modeled by the neurons, allow BaeNet to learn compact part representations in the decoder branches in L3. Figure 3 shows a toy example to illustrate the joint space and contrasts the part representations that are learned by our BaeNet versus a classic CNN autoencoder.
BaeNet has a simple architecture and as such, it can be easily adapted to perform oneshot learning, where only one or few exemplar segmentations are provided. In this case, the number of branches is set according to the exemplar(s) and the shape reconstruction loss is complemented by a label reconstruction loss; see Section 3.1.
We demonstrate unsupervised, weakly supervised, and oneshot learning results for shape cosegmentation on the ShapeNet [5], ShapeNet part [62], and Tags2Parts [34] datasets, comparing BaeNet with existing supervised and weakly supervised methods. The cosegmentation results from BaeNet are consistent, without explicitly enforcing any consistency loss in the network. Using only one (resp. two or three) segmented exemplars, the oneshot learning version of BaeNet outperforms stateoftheart supervised segmentation methods, including PointNet++ and PointCNN, when the latter are trained on 10% (resp. 20% or 30%), i.e., hundreds, of the segmented shapes.
2 Related work
Our work is most closely related to prior research on unsupervised, weakly supervised, and semisupervised cosegmentation of 3D shapes. Many of these methods, especially those relying on statistical learning, are inspired by related research on 2D image segmentation.
Image cosegmentation without strong supervision. One may view unsupervised image cosegmentation as pixellevel clustering, guided by color similarity within single images as well as consistency across image collections. Existing approaches employ graph cuts [43], discriminative clustering [23, 24], correspondences [44, 45], cooperative games [30], and deep neural networks [16, 55, 59, 64].
An alternative research theme utilizes some weak cues, such as imagelevel labels. In this weakly supervised setup, the goal is to find those image regions that strongly correlate with each label, using either traditional statistical models [7, 51, 54] or newer ones based on deep networks [1, 9, 11, 22, 28, 35, 37, 38, 39, 46, 56, 65]. Other forms of weak supervision include bounding boxes [8, 17, 26, 36, 42, 66] and textual captions [4, 10]. Semisupervised methods assume that full supervision is available for a few images, while the rest are unsupervised: such approaches have been applied to image cosegmentation, e.g. [32, 58].
In contrast to all the above methods, we develop an unsupervised cosegmentation approach for geometric shapes, without color data. Thus, we concentrate on efficient modeling of spatialvariance. We employ a novel encodeandreconstruct scheme, where each branch of a deep network learns to localize instances of a part across multiple examples in order to compactly represent them, and reassemble them into the original shape. Our method easily adapts to weakly and semisupervised scenarios.
Our method is critically dependent on the choice of network architecture. The relatively shallow fullyconnected stack is highcapacity enough to model nontrivial parts, but shallow enough that the neurons are forced to learn a compact, efficient representation of the shape space, in terms of recurring parts carved out by successive simple units. Thus, the geometric prior is inherent in the architecture itself, similar in spirit to deep image priors [53].
3D segmentation without strong supervision. Building on substantial prior work on singleshape mesh segmentation based on geometric cues [47], the pioneering work of Golovinskiy and Funkhouser [12] explored consistent cosegmentation of 3D shapes by constructing a graph connecting not just adjacent polygonal faces in the same mesh, but also corresponding faces across different meshes. A normalized cut of this graph yields a joint segmentation. Subsequently, several papers developed alternative unsupervised clustering strategies for mesh faces, given a handcrafted similarity metric induced by a feature embedding or a graph [18, 20, 33, 50, 60]. Shu et al. [49]
modified this setup by first transforming handcrafted local features with a stacked autoencoder before applying an (independently learned) Gaussian mixture model and pershape graph cuts. In contrast, our method is an endtoend differentiable pipeline that requires no manual feature specification or largescale graph optimization.
Tulsiani et al. [52] proposed an unsupervised method to approximate a 3D shape with a small set of primitive parts (cuboids), inducing a segmentation of the underlying mesh. Their approach has similarities to ours – they predict part cuboids with branches from an encoder network, and impose a reconstruction loss to make the cuboid assembly resemble the original shape. However, the critical difference is the restriction to cuboidal boxes: they cannot accommodate complex, nonconvex part geometries such as the rings in Figure 4 and the groups of disjoint lamp parts in Figure 1, for which a nonlinear stack of neurons is a much more effective indicator function.
In parallel, approaches were developed for weakly supervised [34, 48] and semisupervised [19, 27, 57] shape segmentation. Shapes formed by articulation of a template shape can be jointly cosegmented [2, 61]. Our method does not depend on any such supervision or base template, although it can optionally benefit from one or two annotated examples to separate strongly correlated part pairs.
3 BaeNet: architecture, loss, and training
The architecture of BaeNet draws inspiration from the recently introduced implicit decoder of Chen and Zhang [6]
. Their network learns an implicit field by means of a binary classifier, which is similar to
BaeNet. The main difference, as shown in Figure 2, is that BaeNet is designed to segment a shape into different parts by reconstructing the parts in different branches of the network.Similar to [6]
, we use a traditional convolutional neural nework as the encoder to produce the feature code for a given shape. We also adopt a threelayer fully connected neural network as our decoder network. The network takes a joint vector of point coordinates and feature code as input, and outputs a value in each output branch that indicates whether the input point is inside a part (1) or not (0). Finally, we use a max pooling operator to merge parts together and obtain the entire shape, which allows our segmented parts to overlap. We use “L1”, “L2” and “L3” to represent the first, second, and third layer, respectively. The different network design choices will be discussed in Section
4.3.1 Network losses for various learning scenarios
Unsupervised.
For each input point coordinate, our network outputs a value that indicates the likelihood that the given point is inside the shape. We train our network with sampled points in the 3D space surrounding the input shape and the insideoutside status of the sampled points. After sampling points for input shapes using the same method as [6], we can train our autoencoder with a mean square loss:
(1) 
where is the distribution of training shapes, is the distribution of sampled points given shape , is the output value of our decoder for input point , and is the ground truth insideoutside status for point
. This loss function allows us to reconstruct the target shape in the output layer. The segmented parts will be expressed in the branches of L3, since the final output is taken as the maximum value over the fields represented by those branches.
Supervised.
If we have examples with ground truth part labels, we can also train BaeNet in a supervised way. Denote as the ground truth status for point , and as the output value of the th branch in L3. For a network with branches in L3, the supervised loss is:
(2) 
In datasets such as the ShapeNet part dataset [62], shapes are represented by point clouds sampled from their surfaces. In such datasets, the insideoutside status of a point can be ambiguous. However, since our sampled points are from voxel grids and the reconstructed shapes are thicker than the original, we can assume all points in the ShapeNet part dataset are inside our reconstructed shapes. We use both our sampled points from voxel grids and the point clouds in the ShapeNet part dataset, by modifying the loss function:
(3) 
where is the distribution for our sampled points from voxel grids, and is the distribution of points in the ShapeNet part dataset. We set to 1 in our experiments.
Oneshot learning.
Our network also supports oneshot training, where we feed the network one (or 2, 3…) shapes with ground truth part labels, and other shapes without part labels. To enable oneshot training, we have a joint loss:
(4) 
where is the distribution of all shapes, and is the distribution of the few given shapes with part labels. We do not explicitly use this loss function or set . Instead, we train the network using the unsupervised and supervised losses alternately. We do one supervised training iteration after every 4 unsupervised training iterations.
Additionally, we add a very small regularization term for the parameters of L3, so as to prevent unnecessary overlap, e.g., when the part output by one branch contains the part output by another branch.
3.2 Point label assignment
After training, we get an approximate implicit field for the input shape. To label a given point of an input shape, we simply feed the point into the network together with the code encoding the feature of the input shape, and label the point by looking at which branch in L3 gives the highest value. If the training has exemplar shapes as guidance, each branch will be assigned a label automatically with respect to the exemplars. If the training is unsupervised, we need to look at the branch outputs and give a label for each branch by hand. For example in Figure 2, we can label branch #3 as “jet engine”, and each point having the highest value in this branch will be labeled as “jet engine”. To label a mesh model, we first subdivide the surfaces to obtain finegrained triangles, and assign a label for each triangle. To label a triangle, we feed its three vertices into the network and sum their output values in each branch, and assign the label whose associated branch gives the highest value.
3.3 Training details
In the following, we denote the decoder structure by the width (number of neurons) of each layer as { L1L2L3 }. The encoders for all tasks are standard CNN encoders.
In the 2D shape extraction experiment, we set the feature vector to be 16D, since the goal of this experiment is to explain why and how the network works and the shapes are easy to represent. We used the same width for all hidden layers since it is easier for us to compare models with different depths. For images, the decoder is { 2564 } for 2layer model, { 2562564 } for 3layer model, and { 2562562564 } for 4layer model. For images, we use as the width instead of .
In all other experiments, our encoder takes voxels as input, and produces a D feature code. We sample points from each shape’s voxel model, and use these pointvalue pairs to compute the unsupervised loss .
For unsupervised tasks, we set the decoder network to { 307238412 } and train 200,000 iterations with minibatches of size 1, which takes 2 hours per category. For oneshot tasks, we use a { 1024256 } decoder, where is the number of ground truth parts in exemplar shapes. The decoder is lighter, hence we finish training in a much shorter time. For each category, we train 100,000 iterations: on all 15 categories this takes less than 10 hours total on a single NVIDIA GTX 1080Ti GPU. We also find that doing 3,0004,000 iterations of supervised training before alternating it with unsupervised training improves the results. We plan to release all code for this project.
4 Experiments and results
In this section, we show qualitative and quantitative segmentation results in various settings with our BaeNet and compare them to those from stateoftheart methods. But first, in Section 4.1, we discuss different architecture design choices and offer insights into how our network works.
4.1 Network design choices and insights
We first explain our network design choices in detail, as illuminated by two synthetic 2D datasets, “elements” and “triple rings”. “Elements” is synthesized by putting three different shape patterns over images, where the cross is placed randomly on the image, the triangle is placed randomly on a horizontal line in the middle of the image, and the rhombus is placed randomly on a vertical line in the middle of the image. “Triple rings” is synthesized by placing three rings of different sizes randomly over images. See Figure 4 for some examples.
First, we train BaeNet with 4 branches on the two datasets; see some results in Figure 4. Our network successfully separated the shape patterns, even when two patterns overlap. Further, each of the output branches only outputs one specific shape pattern, thus we also obtain a shape correspondence from the cosegmentation process.
We visualize the neuron activations in Figure 5
. In L1, the point coordinates and the shape feature code have gone through a linear transform and a leaky ReLU activation, therefore the activation maps in L1 are linear “space dividers” with gradients. In L2, each neuron linearly combines the fields in L1 to form basic shapes. The combined shapes are mostly convex: although nonconvex shapes can be formed, they will need more resources (L1 neurons) than simpler shapes. This is because L2 neurons calculate a weighted sum of the values in L1 neurons, not MIN, MAX, or a logical operation, thus each L1 neuron brings a global, rather than local, change in L2. L2 represents higher level shapes than L1, therefore we can expect L1 to have many more neurons than L2, and we incorporate this idea in our network design for shape segmentation, to significantly shorten training time. The L3 neurons further combine the shapes in L2 to form output parts in our network, and our final output combines all L3 outputs via max pooling.
After understanding how the network works, we will be able to explain why our network tends to output segmented, corresponding parts in each branch. For a single shape, the network has limited representation ability in L3, therefore it prefers to construct simple parts in each branch, and let our max pooling layer combine them together. This allows better reconstruction quality than reconstructing the whole shape in just one branch. With an appropriate optimizer to minimize the reconstruction error, we can get wellsegmented parts in the output branches.
For part correspondence, we need to also consider the input shape feature codes. As shown in Figure 2, our network treats the feature code and point coordinates equally. This allows us to consider the whole decoder as a hyperdimensional implicit field, in a joint space made by both image dimensions (input point coordinates) and shape feature dimensions (shape feature code). In Figure 3, we visualize a 3D slice of of this implicit field, with two image space dimensions and one feature dimension. Our network is trying to find the best way to represent all shapes in the training set, and the easiest way is to arrange the training shapes so that the hyperdimensional shape is continuous in the feature dimension, as shown in the figure. This encourages the network to learn to represent the final complex hyperdimensional shape as a composition of a few simple hyperdimensional shapes. In Figure 4
, we show how our trained network can accomplish smooth interpolation and extrapolation of shapes. Our network is able to simultaneously accomplish segmentation and correspondence.
We compare the segmentation result of our current 3layer model with a 2layer model, a 4layer model and a CNN model in Figure 4. (Detailed network parameters are in supplementary material.) The 2layer model has a hard time reconstructing the rings, since L2 is better at representing convex shapes. The 4layer model can separate parts, but since most shapes can already be represented in L3, the extra fourth layer does not necessarily output separated parts. One can easily construct an L4 layer on top of our 3layer model to output the whole shape in one branch and leave the other branches blank. The CNN model is not sensitive to parts and outputs basically everything or nothing in each branch, because there is no bias towards sparsity or segmentation. Through our experiments, the network with 3 layers is the best choice for independent shape extraction, which makes it a perfect candidate for unsupervised and weakly supervised shape segmentation.
4.2 Evaluation of unsupervised learning
Shape (#parts)  airplane (3)  bag (2)  cap (2)  chair (3)  chair* (4)  mug (2)  skateboard (2)  table (2) 

Segmented  body, tail,  body,  panel,  back+seat,  back, seat,  body,  deck,  top, 
parts  wing+engine  handle  peak  leg, arm  leg, arm  handle  wheel+bar  leg+support 
IOU  61.1  82.5  87.3  65.5  83.7  93.4  63.5  78.7 
modIOU  80.4  82.5  87.3  86.6  83.7  93.4  88.1  87.0 
We first test unsupervised cosegmentation over 20 categories, where 16 of them are from the ShapeNet part dataset [62]. These categories and the number of shapes are: planes (2,690), bags (76), caps (55), cars (898), chairs (3,758), earphones (69), guitars (787), knives (392), lamps (1,547), laptops (451), motors (202), mugs (184), pistols (283), rockets (66), skateboards (152), and tables (5,271). The 4 extra categories, benches (1,816), rifles (2,373), couches (3,173), and vessels (1,939), are from ShapeNet [5].
We train individual models for different shape categories. Some results are shown in Figures 1 and 6, with more in the supplemental material. Reasonable parts are obtained, and each branch of BaeNet only outputs a specific part, giving us natural part correspondences. Our unsupervised segmentation is not guaranteed to produce the same part counts as those in the ground truth; it tends to produce coarser segmentation results, e.g., combining the seat and back of a chair. Since coarser segmentations are not necessarily wrong results, in Table 1, we report two sets of Intersection over Union (IOU) numbers which compare segmentation results by BaeNet and the ground truth, one allowing part combinations and the other not.
Although unsupervised BaeNet does not separate chair back and seat when training on the chair category, it can do so when tables are added for training. Also, it successfully corresponds chair seats with tabletops, chair legs with table legs; see Figure 6. This leads to a weakly supervised way of segmenting target parts, as we discuss next.
4.3 Comparison with Tags2Parts
Arm  Back  Engine  Sail  Head  

Tags2Parts [34]  0.71  0.79  0.46  0.84  0.37 
BaeNet  0.94  0.85  0.88  0.92  0.76 
Now we compare our BaeNet with a stateoftheart weakly supervised part labeling network, Tags2Parts [34]. Given a shape dataset and a binary label for each shape indicating whether a target part appears in it or not, Tags2Parts can separate out the target parts, with the binary labels as weak supervision. Our network can do the same task with even weaker supervision. We do not pass the labels to the network or incorporate them into the loss function. Instead, we use the labels to change the training data distribution.
Our intuition is that, if two parts always appear together and combine in the same way, like chair back and seat, treating them as one single part is more efficient for the network to reconstruct them. But when we change the data distribution, say letting only 20% of the shapes have the target part (such as chair backs), it will be more natural for the network to separate the two parts and reconstruct the target part in a single branch. Therefore, we add weak supervision by simply making the number of shapes that do not have the target part four times as many as the shapes that have the target part, by duplicating the shapes in the dataset.
We used the dataset provided by [34]
, which contains six part categories: (chair) armrest, (chair) back, (car) roof, (airplane) engine, (ship) sail, and (bed) head. We run our method on all categories except for car roof, since it is a flat surface part that our network cannot separate. We used the unsupervised version of our network to perform the task, first training for a few epochs using the distribution altered dataset for initialization, then only training our network on those shapes that have target parts to refine the results.
To compare results, we used the same metric as in [34]
: Area Under the Curve (AUC) of precision/recall curves. For each test point, we get its probability of being in each part by normalizing the branch outputs with a unit sum. Quantitative and visual results are shown in Table
2 and Figure 7. Note that some parts, e.g., plane engines, that cannot be separated when training on the original dataset are segmented when training on the altered dataset. Our network particularly excels at segmenting planes, converging to eight effective branches representing body, left wing, right wing, engine and wheel on wings, jet engine and front wheel, vertical stabilizer, and two types of horizontal stabilizers.4.4 Oneshot training vs supervised methods
1exem. vs.  2exem. vs.  3exem. vs.  

10% train set  20% train set  30% train set  
Pointnet [40]  72.1  73.0  74.6 
Pointnet++ [41]  69.6  75.4  76.6 
PointCNN [29]  58.0  65.6  65.7 
SSCN [13]  56.7  61.0  64.6 
Our BaeNet  74.1  76.4  77.4 
plane  bag  cap  chair  earph.  guitar  knife  lamp  laptop  motor.  mug  pistol  rocket  skate.  table  Mean  
Pointnet [40]  76.1  69.8  62.6  86.0  62.1  86.2  79.7  73.6  93.3  59.1  83.4  75.9  41.8  57.7  74.8  72.1 
Pointnet++ [41]  76.4  43.4  77.8  87.5  67.7  87.4  18.2  71.4  94.1  61.3  90.4  72.8  51.4  68.7  75.3  69.6 
PointCNN [29]  73.6  44.6  36.5  86.1  35.1  87.0  80.6  75.9  94.6  16.9  48.6  52.9  22.7  43.8  71.2  58.0 
SSCN [13]  74.2  50.9  46.2  84.5  58.6  86.2  76.0  59.6  53.6  25.3  46.0  42.7  25.1  44.3  77.0  56.7 
Ours 1exem.  71.2  73.9  81.6  85.6  62.6  86.4  82.1  64.0  94.5  52.8  94.7  73.8  39.8  70.4  78.1  74.1 
Ours 2exem.  72.9  81.8  84.7  85.0  64.1  88.2  81.3  62.9  94.4  59.4  94.1  76.4  48.2  72.7  80.5  76.4 
Ours 3exem.  74.6  75.2  82.0  85.1  76.1  88.3  82.4  66.4  94.3  63.7  94.6  77.3  47.2  72.8  80.7  77.4 
2layer 1exem.  61.2  71.8  58.9  84.6  68.0  52.2  26.3  65.4  94.3  28.0  92.3  71.6  31.3  45.5  80.6  62.1 
4layer 1exem.  72.7  80.7  78.7  81.8  59.7  86.5  37.8  63.5  94.1  57.5  94.5  71.2  41.6  60.5  78.9  70.6 
Finally, we use one, or a few, segmented exemplars to enforce BaeNet to output designated parts, in order to evaluate our results using ground truth labels, and compare with other methods. In detail, we manually select one, two, or three segmented exemplar shapes from the training set, and train our model on the exemplars using the supervised loss, while training on the whole set using unsupervised loss.
We did not find other semisupervised segmentation methods that take only a few exemplars and segment shapes in a whole set. Therefore we compare our work with several stateoftheart supervised methods, namely PointNet [40], PointNet++ [41], PointCNN [29] and SSCN [13]. Since it would be unfair to provide the supervised methods with only 13 exemplars to train, as we did for BaeNet, we train their models using 10%, 20%, or 30% of the shapes from the datasets (average dataset size per category = 765 shapes) for comparison. We evaluate all methods on the ShapeNet part dataset [62] by average IOU (no part combinations are tolerated), and train individual models for different shape categories. We did not include the car category for the same reason as in Section 4.3.
Ideally, we hope to select exemplars that contain all the ground truth parts and are representative of the shape collection, but this can be difficult. For example, there are two kinds of lamps – ground lamps and ceiling lamps – that can both be segmented into three parts. However, the order and labels of the lamp parts are different, e.g., the base of a ground lamp and the base of a ceiling lamp have different labels. Our network cannot infer such semantic information from 13 exemplars, thus we selected only one type of lamps (ground lamps) as exemplars, and our network only has three branches (without one for the base of the ceiling lamp). During evaluation, we add a fake branch that only outputs zero, to represent the missing branch.
The qualitative result for the chair category is shown in Figure 8. The quantitative results for all categories are shown in Table 3, and some percategory results in Table 4. Our network training with 1/2/3 exemplars outperforms supervised methods with 10%/20%/30% of the training set. In some categories, our model training on one exemplar already achieves high accuracy. In a few categories, adding more exemplars reduces accuracy, which may be due to the ambiguity brought by the extra exemplars. We also show an ablation study of changing the number of network layers, similar to the one in Sec 4.1. We use { 1024 } and { 1024256256 } for the 2layer and 4layer models, respectively, and train them with one exemplar in the same setting as our default 3layer model. The 2layer model performs poorly on several categories since it lacks the representation ability to reconstruct complex shapes, e.g., shapes with round parts (cap, guitar, motorbike). The 4layer model has very similar performance to the 3layer model, but gets lower performance on some categories because it often merges a small part into a nearby larger part. These variations resemble the trends in Figure 4.
5 Conclusion, limitations, and future work
We have developed BaeNet, a branched autoencoder, for unsupervised, oneshot, and weakly supervised shape cosegmentation. Experiments show that our network can outperform stateoftheart supervised methods, including PointNet++, PointCNN, etc., using much less training data (13 exemplars vs. 77230 for the supervised methods, average over 15 shape categories). On the other hand, compared to the supervised methods, BaeNet tends to produce coarser segmentations, which are not necessarily incorrect and can provide a good starting point for further refinement.
Many prior unsupervised cosegmentation methods [12, 18, 20, 50], which are modeldriven rather than datadriven, had only been tested on very small input sets (less than 50 shapes). In contrast, BaeNet can process much larger collections (up to 5,000+ shapes). In addition, unless otherwise noted, all the results shown in the paper were obtained using the default network settings, further validating the generality and robustness of our cosegmentation network.
BaeNet is able to produce consistent segmentations, over large shape collections, without explicitly enforcing a consistency loss – the consistency is a consequence of the network architecture. That being said, our current method does not provide any theoretical guarantee for segmentation consistency or universal part counts and such extensions could build upon the results of BaeNet.
For unsupervised segmentation, since we initialize the network parameters randomly and optimize a reconstruction loss, while treating each branch equally, there is no way to predict which branch will eventually output which part. The network may also be sensitive to the initial parameters, where different runs may result in different segmentation results, e.g., combining seat and back vs seat and legs for the chair category. Note however that both results may be acceptable as a coarse structure interpretation for chairs.
Another drawback is that our network groups similar and closeby parts in different shapes for correspondence. This is reasonable in most cases, but for some categories, e.g., lamps or tables, where the similar and closeby parts may be assigned different labels, our network can be confused. How to incorporate shape semantics into BaeNet is worth investigating. Finally, BaeNet is much shallower and thinner compared to the network in [6], since we care more about segmentation (not reconstruction) quality. However, the limited depth and width of the network makes it difficult to train on highresolution models (say ), which hinders us from obtaining finegrained segmentations.
In future work, besides addressing the issues above, we plan to introduce hierarchies into the shape representation and network structure, since it is more natural to segment shapes in a coarsetofine manner. Also, BaeNet provides basic part separation and correspondence, which could be incorporated when developing generative models.
References
 [1] J. Ahn and S. Kwak. Learning pixellevel semantic affinity with imagelevel supervision for weakly supervised semantic segmentation. In CVPR, 2018.
 [2] D. Anguelov, D. Koller, H. Pang, P. Srinivasan, and S. Thrun. Recovering articulated object models from 3D range data. In UAI, 2004.
 [3] V. Badrinarayanan, A. Kendall, and R. Cipolla. SegNet: A deep convolutional encoderdecoder architecture for image segmentation. TPAMI, 2017.
 [4] T. Berg, A. Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. LearnedMiller, and D. Forsyth. Names and faces in the news. In CVPR, 2004.
 [5] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. ShapeNet: An informationrich 3D model repository. CoRR, abs/1512.03012, 2015.
 [6] Z. Chen and H. Zhang. Learning implicit fields for generative shape modeling. In CVPR, 2019.
 [7] R. G. Cinbis, J. J. Verbeek, and C. Schmid. Weakly supervised object localization with multifold multiple instance learning. TPAMI, 39(1), 2017.
 [8] J. Dai, K. He, and J. Sun. BoxSup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In ICCV, 2015.

[9]
T. Durand, T. Mordan, N. Thome, and M. Cord.
Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation.
In CVPR, 2017.  [10] M. Everingham, J. Sivic, and A. Zisserman. Taking the bite out of automatic naming of characters in TV video. In Image and Vision Computing, 2009.
 [11] W. Ge, S. Yang, and Y. Yu. Multievidence filtering and fusion for multilabel classification and object detection and semantic segmentation based on weakly supervised learning. In CVPR, 2018.
 [12] A. Golovinskiy and T. Funkhouser. Consistent segmentation of 3D models. Computers & Graphics, 33(3), 2009.
 [13] B. Graham, M. Engelcke, and L. van der Maaten. 3D semantic segmentation with submanifold sparse convolutional networks. In CVPR, 2018.
 [14] D. O. Hebb. The Organization of Behavior. 1949.
 [15] D. D. Hoffman and W. A. Richards. Parts of recognition. Cognition, 1984.
 [16] K.J. Hsu, Y.Y. Lin, and Y.Y. Chuang. Coattention CNNs for unsupervised object cosegmentation. In Proc. IJCAI, 2018.
 [17] R. Hu, P. Dollár, K. He, T. Darrell, and R. Girshick. Learning to segment every thing. In CVPR, 2018.
 [18] R. Hu, L. Fan, and L. Liu. Cosegmentation of 3D shapes via subspace clustering. Computer Graphics Forum, 31(5), 2012.

[19]
H. Huang, E. Kalogerakis, and B. Marlin.
Analysis and synthesis of 3D shape families via deeplearned generative models of surfaces.
Computer Graphics Forum, 34(5), 2015. 
[20]
Q. Huang, V. Koltun, and L. Guibas.
Joint shape segmentation with linear programming.
Trans. Graph. (SIGGRAPH Asia), 30(6), 2011.  [21] Q. Huang, W. Wang, and U. Neumann. Recurrent slice networks for 3D segmentation of point clouds. In CVPR, 2018.
 [22] Z. Huang, X. Wang, J. W. W. Liu, and J. Wang. Weaklysupervised semantic segmentation network with deep seeded region growing. In CVPR, 2018.
 [23] A. Joulin, F. Bach, and J. Ponce. Discriminative clustering for image cosegmentation. In CVPR, 2010.
 [24] A. Joulin, F. Bach, and J. Ponce. Multiclass cosegmentation. In CVPR, 2012.
 [25] E. Kalogerakis, M. Averkiou, S. Maji, and S. Chaudhuri. 3D shape segmentation with projective convolutional networks. In CVPR, volume 1, 2017.
 [26] A. Khoreva, R. Benenson, J. Hosang, M. Hein, and B. Schiele. Simple does it: Weakly supervised instance and semantic segmentation. In CVPR, 2017.
 [27] V. G. Kim, W. Li, N. J. Mitra, S. Chaudhuri, S. DiVerdi, and T. Funkhouser. Learning partbased templates from large collections of 3D shapes. Trans. Graph. (SIGGRAPH), 32(4), 2013.
 [28] A. Kolesnikov and C. H. Lampert. Seed, expand and constrain: Three principles for weaklysupervised image segmentation. In ECCV, 2016.
 [29] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen. PointCNN: Convolution on Xtransformed points. In NeurIPS, 2018.
 [30] B.C. Lin, D.J. Chen, and L.W. Chang. Unsupervised image cosegmentation based on cooperative game. In Proc. ACCV, 2014.
 [31] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 [32] T. Ma and L. J. Latecki. Graph transduction learning with connectivity constraints with application to multiple foreground cosegmentation. In CVPR, 2013.
 [33] M. Meng, J. Xia, J. Luo, and Y. He. Unsupervised cosegmentation for 3D shapes using iterative multilabel optimization. CAD, 45(2), 2013.
 [34] S. Muralikrishnan, V. G. Kim, and S. Chaudhuri. Tags2Parts: Discovering semantic regions from shape tags. In CVPR, 2018.
 [35] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization for free? – weaklysupervised learning with convolutional neural networks. In CVPR, 2015.

[36]
G. Papandreou, L.C. Chen, K. Murphy, and A. L. Yuille.
Weakly and semisupervised learning of a DCNN for semantic image segmentation.
In ICCV, 2015.  [37] D. Pathak, P. Krähenbühl, and T. Darrell. Constrained convolutional neural networks for weakly supervised segmentation. In ICCV, 2015.
 [38] D. Pathak, E. Shelhamer, J. Long, and T. Darrell. Fully convolutional multiclass multiple instance learning. In ICLR Workshop, 2015.
 [39] P. O. Pinheiro and R. Collobert. From imagelevel to pixellevel labeling with convolutional networks. In CVPR, 2015.
 [40] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: Deep learning on point sets for 3D classification and segmentation. In CVPR, 2017.
 [41] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, 2017.
 [42] M. Rajchl, M. C. H. Lee, O. Oktay, K. Kamnitsas, J. PasseratPalmbach, W. Bai, M. Damodaram, M. A. Rutherford, J. V. Hajnal, B. Kainz, and D. Rueckert. DeepCut: Object segmentation from bounding box annotations using convolutional neural networks. Trans. Med. Imag., 36, 2017.
 [43] C. Rother, V. Kolmogorov, T. Minka, and A. Blake. Cosegmentation of image pairs by histogram matching – incorporating a global constraint into MRFs. In CVPR, 2006.
 [44] M. Rubinstein, A. Joulin, J. Kopf, and C. Liu. Unsupervised joint object discovery and segmentation in internet images. In CVPR, 2013.
 [45] J. C. Rubio, J. Serrat, A. Lopez, and N. Paragios. Unsupervised cosegmentation through region matching. In CVPR, 2012.
 [46] F. Saleh, M. S. A. Akbarian, M. Salzmann, L. Petersson, S. Gould, and J. M. Alvarez. Builtin foreground/background prior for weaklysupervised semantic segmentation. In ECCV, 2016.
 [47] A. Shamir. A survey on mesh segmentation techniques. Computer Graphics Forum, 27(6), 2008.
 [48] P. Shilane and T. Funkhouser. Distinctive regions of 3D surfaces. Trans. Graph., 26(2), 2007.
 [49] Z. Shu, C. Qi, S. Xin, C. Hu, L. Wang, Y. Zhang, and L. Liu. Unsupervised 3D shape segmentation and cosegmentation via deep learning. CAGD, 43, 2016.

[50]
O. Sidi, O. van Kaick, Y. Kleiman, H. Zhang, and D. CohenOr.
Unsupervised cosegmentation of a set of shapes via descriptorspace spectral clustering.
Trans. Graph. (SIGGRAPH Asia), 30(6), 2011.  [51] H. O. Song, R. B. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darrell. On learning to localize objects with minimal supervision. In ICML, 2014.
 [52] S. Tulsiani, H. Su, L. J. Guibas, A. A. Efros, and J. Malik. Learning shape abstractions by assembling volumetric primitives. CVPR, 2017.
 [53] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Deep image prior. In CVPR, 2018.
 [54] C. Wang, W. Ren, K. Huang, and T. Tan. Weakly supervised object localization with latent category learning. In ECCV, 2014.
 [55] C. Wang, B. Yang, and Y. Liao. Unsupervised image segmentation using convolutional autoencoder with total variation regularization as preprocessing. In Proc. ICASSP, 2017.
 [56] X. Wang, S. You, X. Li, and H. Ma. Weaklysupervised semantic segmentation by iteratively mining common object features. In CVPR, 2018.
 [57] Y. Wang, S. Asafi, O. Van Kaick, H. Zhang, D. CohenOr, and B. Chen. Active coanalysis of a set of shapes. Trans. Graph. (SIGGRAPH Asia), 31(6), 2012.
 [58] Z. Wang and R. Liu. Semisupervised learning for large scale image cosegmentation. In ICCV, 2013.
 [59] X. Xia and B. Kulis. WNet: A deep model for fully unsupervised image segmentation. CoRR, abs/1711.08506, 2017.
 [60] K. Xu, H. Li, H. Zhang, D. CohenOr, Y. Xiong, and Z.Q. Cheng. Stylecontent separation by anisotropic part scales. Trans. Graph. (SIGGRAPH Asia), 29(6), 2010.
 [61] L. Yi, H. Huang, D. Liu, E. Kalogerakis, H. Su, and L. Guibas. Deep part induction from articulated object pairs. Trans. Graph. (SIGGRAPH Asia), 37(6), 2018.
 [62] L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, L. Guibas, et al. A scalable active framework for region annotation in 3D shape collections. Trans. Graph. (SIGGRAPH Asia), 35(6), 2016.
 [63] L. Yi, H. Su, X. Guo, and L. J. Guibas. SyncSpecCNN: Synchronized spectral CNN for 3D shape segmentation. In CVPR, 2017.
 [64] J. Yu, D. Huang, and Z. Wei. Unsupervised image segmentation via stacked denoising autoencoder and hierarchical patch indexing. Signal Processing, 143, 2018.
 [65] X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. Huang. Adversarial complementary learning for weakly supervised object localization. In CVPR, 2018.
 [66] X. Zhao, S. Liang, and Y. Wei. Pseudo mask augmented object detection. In CVPR, 2018.
Comments
There are no comments yet.