1 Introduction
Segmentation and labeling of 3D shapes is an important problem in geometry processing. These structural annotations are critical for many applications, such as animation, geometric modeling, manufacturing, and search [Mitra et al. 2013]. Recent methods have shown that, by supervised training from labeled shape databases, stateoftheart performance can be achieved on mesh segmentation and part labeling [Kalogerakis et al. 2010, Yi et al. 2016]. However, such methods rely on carefullyannotated databases of shape segmentations, which is an extremely laborintensive process. Moreover, these methods have used coarse segmentations into just a few parts each, and do not capture the finegrained, hierarchical structure of many realworld objects. Capturing finescale part structure is very difficult with nonexpert manual annotation; it is difficult even to determine the set of parts and labels to separate. Another option is to use unsupervised methods that work without annotations by analyzing geometric patterns [van Kaick et al. 2013]. Unfortunately, these methods do not have access to the full semantics of shapes and as a result often do not identity parts that are meaningful to humans, nor can they apply language labels to models or their parts. Additionally, typical coanalysis techniques do not easily scale to large datasets.
We observe that, when creating 3D shapes, artists often provide a considerable amount of extra structure with the model. In particular, they separate parts into hierarchies represented as scene graphs, and annotate individual parts with textual names. In surveying the online geometry repositories, we find that most shapes are provided with these kinds of user annotations. Furthermore, there are often thousands of models per category available to train from. Hence, we ask: can we exploit this abundant and freelyavailable metadata to analyze and annotate new geometry?
Using these userprovided annotations comes with many challenges. For instance, Figure Learning Hierarchical Shape Segmentation and Labeling from Online Repositories(a) shows four typical scene graphs in the car category, created by four different authors. Each one has a different part hierarchy and set of parts, e.g., only two of the scene graphs have the steering wheel of the car as a separate node. The hierarchies have different depths; some are nearlyflat hierarchies and some are more complex. Only a few parts are given names in each model. Despite this variability, inspecting these models reveal common trends, such as certain parts that are frequently segmented, parts that are frequently given consistent names, and pairs of parts that frequently occur in parentchild relationships with each other. For example, the tire is often a separate part, it is usually the child of the wheel, and usually has a name like tire or RightTire. Our goal is to exploit these trends, while being robust to the many variations in names and hierarchies that different model creators use.
This paper proposes to learn shape analysis from these messy, usercreated datasets, thus leveraging the freelyavailable annotations provided by modelers. Our main goal is to automatically discover common trends in part segmentation, labeling, and hierarchy. Once learned, our method can be applied to new shapes that consist of geometry alone: the new shape is automatically segmented into parts, which are labeled and placed in a hierarchy. Our method can also be used to cleanup the existing databases. Our method is designed to work with large training sets, learning from thousands of models in a category. Because the annotations are uncurated, sparse (within each shape) and irregular, this problem is an instance of weaklysupervised learning.
Our approach handles each shape category (e.g., cars, airplanes, etc.) in a dataset separately. For a given shape category, we first identify the commonlyoccurring part names within that class, and manually condense this set, combining synonyms, and removing uninformative names. We then perform an optimization that simultaneously (a) learns a metric for classifying parts, (b) assigns names to unnamed parts where possible, (c) clusters other unnamed parts, (d) learns a canonical hierarchy for parts in the class, and (e) provides a consistent labeling to all parts in the database. Given this annotation of the training data, we then learn to hierarchically segment new models, using a Markov Random Field (MRF) segmentation algorithm. Our algorithms are designed to scale to training on large datasets by minibatch processing. We use these outputs to train a hierarchical segmentation model. Then, given a new, unsegmented mesh, we can apply this learned model to segment the mesh, transfer the tags, and infer the part hierarchy.
We use our method to analyze shapes from ShapeNet [Chang et al. 2015], a largescale dataset of 3D models and part graphs obtained from online repositories. We demonstrate that our method can mine complex information detecting hierarchies in manmade objects and their constituent parts, obtaining finer scale details than existing alternatives. While our problem is different from what has been explored in previous research, we perform two types of quantitative evaluations. First, we evaluate different variants of our method by holding some tags out, and show that all terms in our objective function are important to obtain the final result. Second, we show that supervised learning techniques require hundreds of manually labeled models until they reach the quality of segmentation that we get without any explicit supervision. We publicly share our code and the processed datasets in order to encourage further research.^{1}^{1}1http://cs.stanford.edu/~ericyi/project_page/hier_seg/index.html
2 Related Work
Recent shape analysis techniques focus on extracting structure from large collections of 3D models [Xu et al. 2016]. In this section we discuss recent work on detecting labeled parts and hierarchies in shape collections.
Shape Segmentation and Labeling.
Given a sufficient number of training examples, it is possible to learn to segment and label novel geometries [Kalogerakis et al. 2010, Yumer et al. 2014, Guo et al. 2015]. While supervised techniques achieve impressive accuracy, they require dense training data for each new shape category, which significantly limits their applicability. To decrease the cost of data collection, researchers have developed methods that rely on crowdsourcing [Chen et al. 2009]
[Wang et al. 2012], or both [Yi et al. 2016]. However, this only decreases the cost of data collection, but does not eliminate it. Moreover, these methods have not demonstrated the ability to identify finegrained model structure, or hierarchies. One can rely solely on consistency in part geometry to extract meaningful segments without supervision [Golovinskiy and Funkhouser 2009, Sidi et al. 2011, Huang et al. 2011, Hu et al. 2012, Kim et al. 2013, Huang et al. 2014]. However, since these methods do not take any human input into account, they typically only detect coarse parts, and do not discover semantically salient regions where geometric cues fail to encapsulate the necessary discriminative information.In contrast, we use the part graphs that accompany 3D models to weakly supervise the shape segmentation and labeling. This is similar in spirit to existing unsupervised approaches, but it mines semantic guidance from ambient data that accompanies most available 3D models.
Our method is an instance of weaklysupervised learning from data on the web. A number of related problems have been explored in computer vision, including learning classifiers and captions from userprovided images on the web, e.g.,
[Izadinia et al. 2015, Li et al. 2016, Ordonez et al. 2011], or image searches, e.g., [Chen and Gupta 2015].Shape Hierarchies.
Previous work attempted to infer scene graphs based on symmetry [Wang et al. 2011] or geometric matching [van Kaick et al. 2013]. However, as with unsupervised segmentation techniques, these methods only succeed in a presence of strong geometric cues. To address this limitation, Liu et al. Liu14 proposed a method that learns a probabilistic grammar from examples, and then uses it to create consistent scene graphs for unlabeled input. However, their method requires accurately labeled example scene graphs. Fisher et al. fisher2011characterizing use scene graphs from online repositories, focusing on arrangements of objects in scenes, whereas we focus on finescale analysis of individual shapes.
In contrast, we leverage the scene graphs that exist for most shapes created by humans. Even though these scene graphs are noisy and contain few meaningful node names (Figure Learning Hierarchical Shape Segmentation and Labeling from Online Repositories(a)), we show that it is possible to learn a consistent hierarchy by combining cues from corresponding sparse labels and similar geometric entities in a joint framework. Such label correspondences not only help our clusters be semantically meaningful, but also help us discover additional common nodes in the hierarchy.
3 Overview
Our goal is to learn an algorithm that, given a shape from a specific class (e.g., cars or airplanes), can segment the shape, label the parts, and place the parts into a hierarchy. Our approach is to train based on geometry downloaded from online model repositories. Each shape is composed of 3D geometry segmented into distinct parts; each part has an optional textual name, and the parts are placed in a hierarchy. The hierarchy for a single model is called a scene graph. As discussed above, different training models may be segmented in different hierarchies; our goal is to learn from trends in the data as to which parts are often segmented, how they are typically labeled, and which parts are typically children of other parts.
We break the analysis into two subtasks:

PartBased Analysis (Section 4). Given a set of meshes in a specific category and their original messy scene graphs, we identify the dictionary of distinct parts for a category, and place them into a canonical hierarchy. This dictionary includes both parts with userprovided names (e.g., wheel) and a clustering of unnamed parts. All parts on the training meshes are labeled according to the part dictionary.

Hierarchical Mesh Segmentation (Section 5). We train a method to segment a new mesh into a hierarchical segmentation, using the labels and hierarchy provided by the previous step. For parts with textual names, these labels are also transferred to the new parts.
We evaluate with testing on holdout data, and qualitative evaluation. In addition, we show how to adapt our model to a benchmark dataset.
Our method makes two additional assumptions. First, our feature vector representations assume consistentlyoriented meshes, following the representation in ShapeNetCore
[Chang et al. 2015]. Second, the canonical hierarchy requires that every type of part has only one possible parent label, e.g., our algorithm might infer that the parent of a headlight is always the body, if this is frequently the case in the training data.In our segmentation algorithm, we usually assume that each connected component in the mesh belongs to a single part. This can be viewed as a form of oversegmentation assumption (e.g., [van Kaick et al. 2013]), and we found it to be generally true for our input data, e.g., see Figure Learning Hierarchical Shape Segmentation and Labeling from Online Repositories(b) and 1. We show results both with and without this assumption in Section 6 and in the Supplemental Material.
4 PartBased Analysis
The first step of our process takes the shapes in one category as input, and identifies a dictionary of parts for that category, a canonical hierarchy for the parts, and a labeling of the training meshes according to this part dictionary. Each input shape is represented by a scene graph: a rooted directed tree , where nodes are parts with geometric features and each edge indicates that part is a child of part . We manually preprocess the userprovided part names into a tag dictionary , which is a list of part names relevant for the input category (Table 1). One could imagine discovering these names automatically. We opted for the manual processing, since the vocabulary of words that appear in ShapeNet part labels is fairly limited, and there are many irregularities in the label usage, e.g., synonyms and mispellings. The parts with a label from the dictionary are then assigned corresponding tags . Note that many parts are untagged, either because no names were provided with the model, or the userprovided names did not map onto names in the dictionary. Note also that is indexes parts within a shape independent of tags; e.g., there is no necessary relation between and part . Each graph has a root node, which has a special root tag, and no parent. For nonleaf nodes, the geometry of any node is the union of geometries of its children.
To produce a dictionary of parts, we could directly use the userprovided tags, and then cluster the untagged parts. However, this naive approach would have several intertwined problems. First, the userprovided tags may be incorrect in various ways: missing tags for known parts (e.g., a wheel not tagged at all), tags given only at a highlevel of the hierarchy (e.g., the rim and the tire are not segmented from the wheel, and they are all tagged as wheel), and tags that are simply wrong. The clustering itself depends on a distance metric, which must be learned from labels. We would like to have tags be applied as broadly and accurately as possible, to provide as much clean training data as possible for labeling and clustering, and to correctly transfer tags when possible. Finally, we would also like to use a parentchild relationships to constrain the part labeling (so that a wheel is not the child of a door), but plausible parentchild relationships are not known a priori.
We address these problems by jointly optimizing for all unknowns: the distance metric, a dictionary of
parts, a labeling of parts according to this dictionary, and a probability distribution over parentchild relationships. The labeling of model parts is also done probabilistically, by the ExpectationMaximization (EM) algorithm
[Neal and Hinton 1998], where the hidden variables are the part labels. The distance metric is encoded in a embedding function , which maps a part represented by a shape descriptor (Appendix A) to a lowerdimensional feature space. The functionis represented as a neural network (Figure
11). Each canonical part has a representative cluster center in the feature space, so that a new part can be classified by nearestneighbors distance in the feature space. Note that the clusters do not have an explicit association with tags: our energy function only encourages parts with the same tag to fall in the same cluster. As a postprocess, we match tag names to clusters where possible.We model parentchild relationships with a matrix , where is, for a part in cluster , the probability that its parent has label . After the learning stage, is converted to a deterministic canonical hierarchy over all of the parts.
Our method is inspired in part by the semisupervised clustering method of Basu et al. basu2004probabilistic. In contrast to their linear embedding of initial features for metric learning, we incorporate a neural network embedding procedure to allow nonlinear embedding in the presence of constraints, and use an EM soft clustering. In addition, Basu et al. basu2004probabilistic do not take hierarchical representations into consideration, whereas our data is inherently a hierarchical part tree.
4.1 Objective function
The EM objective function is:
(1) 
where are the parameters of the embedding , are the label probabilities such that represents the probability of the part of shape be assigned to label cluster, and are the unknown cluster centers. We set throughout all experiments.
The first two terms, and , encourage the concentration of clusters in the embedding space; encourages the separation of visually dissimilar parts in embedding space;
is introduced to estimate the parentchild relationship matrix
; the entropy term is a consequence of the derivation of the EM objective (Appendix LABEL:app:EM) and is required for correct estimation of probabilities. We next describe the energy terms one by one.Our first term favors part embeddings to be near their corresponding cluster centroids:
(2) 
where is the embedding function , represented as a neural network and parametrized by a vector . The network is described in Appendix A.
Second, our objective function constrains the embedding, by favoring small distances for parts that share the same input tag, and for parts that have very similar geometry:
(3)  
We extract all tagged parts and sample pairs from them for the constraint. We set to a small constant to account for nearperfect repetitions of parts, and ensure that these parts are assigned to the same cluster.
Third, our objective favors separation in the embedded space by a margin between parts on the same shape that are not expected to have the same label:
(4)  
We only use parts from the same shape in , since we believe it is generally reasonable to assume that parts on the same shape with distinct tags or distinct geometry have distinct labels.
Finally, we score the labels of parentchild pairs by how well they match the overall parentchild label statistics in the data, using the negative loglikelhood of a multinomial:
(5) 
4.2 Generalized EM algorithm
We optimize the objective function (Equation 1) by alternating between E and M steps. We solve for the soft labeling in the Estep, and the other parameters, , in the Mstep, where are the parameters of the embedding .
Estep.
Holding the model parameters fixed, we optimize for the label probabilities :
(6) 
We optimize this via coordinate descent, by iterating times over all coordinates. The update is given in Appendix C.
Mstep.
Next, we hold the soft clustering fixed and optimize the model parameters by solving the following subproblem:
(7) 
We use stochastic gradient descent updates for
and , as is standard for neural networks, while keeping fixed. The parentchild probabilities are then computed as:(8) 
where is a columnwise normalization function to guarantee . and are the cluster probability vectors that correspond to parts and of the same shape, respectively. in our experiments, to prevent cluster centers from stalling at zero. Since each column of is a separate multinomial distribution, the update in Eq. 8 is the standard multinomial estimator.
Minibatch training.
The dataset for any category is far too large to fit in memory, and so, in practice, we break the learning process into minibatches. Each minibatch includes 50 geometric models at a time. For the set
, 20,000 random pairs of parts are sampled across models in the minibatch. 30 epochs (passes over the whole dataset) are used.
For each minibatch, the Estep is computed as above. In the minibatch Mstep, the embedding parameters and cluster centers are updated by standard stochastic gradient descent (SGD) updates, using Adam updates [Kingma and Ba 2015]. For the hierarchy , we use Stochastic EM updates [Cappé and Moulines 2009], which are more stable and efficient than gradient updates. The sufficient statistics are computed for the minibatch:
(9) 
Running averages for the sufficient statistics are updated after each minibatch:
(10) 
where in our experiments. Then, the estimates for are computed from the current sufficient statistics by:
(11) 
Initialization.
Our objective, like many EM algorithms, requires good initialization. We first initialize the neural network embedding with normalized initialization [Glorot and Bengio 2010]. For each named tag , we specify an initial cluster center as the average of the embeddings of all the parts with that tag. The remaining cluster centroids
are randomly sampled from a normal distribution in the embedding space. The cluster label probabiilities
are initialized by a nearestneighbor hardclustering, and then is initialized by Eq. 8.4.3 Outputs
Once the optimization is complete, we compute a canonical hierarchy from by solving a Directed Minimum Spanning Tree problem, with the root constrained to the entire object. Then, we assign tags to parts in the hierarchy by solving a linear assignment problem that maximizes the number of input tags in each cluster that agree with the tag assigned to their cluster. As a result, some parts in the canonical hierarchy receive textual names from assigned tags. Unmatched clusters are denoted with generic names cluster_N. We then label each input part with its most likely node in by selecting . This gives a part labeling of each node in each input scene graph. An example of the canonical hierarchy with part names, and a labeled shape, is shown in Figure 2.
This canonical hierarchy, part dictionary, and part labels for the input scene graphs are then used to train the segmentation algorithm as described in the next section.
5 Hierarchical Mesh Segmentation
Given the part dictionary, canonical hierarchy, and perpart labels from the previous section, we next learn to hierarchically segment and label new shapes. We formulate the problem as labeling each mesh face with one of the leaf labels from the canonical hierarchy. Because each part label has only one possible parent, all of a leaf node’s ancestors are unambiguous. In other words, once the leaf nodes are specified, it is straightforward to completely convert the shape into a scene graph, with all the nodes in the graph labeled. In order to demonstrate our approach in full generality, we assume the input shape includes only geometry, and no scene graph or part annotations. However, it should be possible to augment our procedure when such information is available.
5.1 Unary classifier
We begin by describing a technique for training a classifier for individual faces. This classifier can also be used to classify connected components. In the next section, we build an MRF labeler from this. Our approach is based on the method of Kalogerakis et al. kalogerakis2010learning, but generalized to handle missing leaf labels and connected components, and to use neural network classifiers.
The face classifier is formulated as a neural network that takes geometric features of a face as input, and assigns scores to the leaf node labels for the face. The feature vector for a face consists of several standard geometric features. The neural network specifies a score function , where is a weight vector for label , and is a sequence of fullyconnected layers and nonlinear activation units, applied to . The score function is normalized by a softmax function to produce an output probability:
(12) 
where is the set of possible leaf node labels. See Appendix B for details of the feature vector and neural network.
To train this classifier, we can apply the perpart labels from the previous section to the individual faces. However, there is one problem with doing so: many training meshes are not segmented to the finest possible detail. For example, a car wheel might not be segmented into tire and rim, or the windows may not be segmented from the body. In this case, the leaf node labels are not given for each face, but only ancestor nodes are known: we do not know which wheel faces are tire faces. In order to handle this, we introduce a probability table . is the probability of a face taking leaf label if the deepest label given for this training face is . For example, is the probability that the correct leaf label for a face labeled as a wheel is tire. To estimate , we first compute the unnormalized by counting the number of faces assigned to both label and label , except that if is not an ancestor of in the canonical hierarchy. Then is determined by normalizing the columns to to sum to 1: .
We then train the classifier by minimizing the following loss function for
and , the parameters of :(13) 
where sums over all faces in the training shapes and is the deepest label assigned to face as discussed above. This loss is the negative loglikelihood of the training data, marginalizing over the hidden true leaf label for each training face, generalizing [Izadinia et al. 2015]. We use Stochastic Gradient Descent to minimize this objective.
We have also observed that meshes in online repositories are comprised of connected components, and these connected components almost always have the same label for the entire component. For most results presented in this paper, we use connected components as the basic labeling units instead of faces, in order to improve results and speed. We define the connected component classifier by aggregating the trained face classifier over all the faces of the connected component as follows:
(14) 
5.2 MRF labeler
Let be the set of leaf node of the canonical hierarchy. In the case of classifying each connected component, we want to specify one leaf node for each connected component . We define the MRF over connected component labels as:
(15) 
with weight is set by crossvalidation separately for each shape category and held constant across all experiments. The unary term assesses the likelihood of component having a given leaf label, based on geometric features of the component, and is given by the classifier:
(16) 
The edge term prefers adjacent components to have the same label. It is defined as , where is tree distance between labels and in the canonical hierarchy. This encourages adjacent labels to be as close in the canonical hierarchy as possible. For example, is 0 when the two labels are the same, whereas is 2 if they are different but share a common parent. To generate the edge set in 15, we connect nearest connected components with this edge, where where is the number of connected components in the mesh.
Once the classifiers and are trained, the model can be applied to a new mesh as follows. First, the leaf labels are determined by optimizing Equation 15 using the  swap algorithm [Boykov et al. 2001]. Then, the scene graph is computed by bottomup grouping. In particular, adjacent components with same leaf label are first grouped together. Then, adjacent groups with the same parent are grouped at the next level of the hierarchy, and so on.
For the case where connected components are not available, the MRF algorithm is applied for each face. The unary term is given by the face classifier . We still need to handle the case where the object is not topologically connected, and so the pairwise term applies to all faces and whose centroids fall into the nearest neighborhood of each other, and is given by:
(17) 
where is the angle between the faces, is the distance between the face centroids, is the average distance between a face’s centroid and it’s nearest face’s centroid, and in all our experiments. is a scale factor to promote faces sharing an edge:
6 Results
In this section we evaluate our method on “in the wild” data from public online repositories and on a standard benchmark. We perform the evaluation by comparing with partbased analysis and segmentation techniques using novel metrics.
Input Data.
We run our method on 9 shape categories from ShapeNetCore dataset [Chang et al. 2015], a collection of 3D models obtained from various online repositories. We use this dataset for convenience, because the data has been preprocessed, cleaned, categorized, and put into common formats; at present, it is the only known current dataset that satisfies our lowlevel preprocessing requirements. We excluded most categories (40) because they only have a few hundred shapes or less, which is inadequate for our approach. We assume that a tag is sufficiently represented if it appears in at least 25 shapes, and we only analyze categories that have more than 2 such tags. Some categories have trivial geometry (e.g., mobile phones). Some categories do not provide enough parts with common labels (e.g., watercraft are very heterogeneous to the point of being disjoint sets of objects). The ShapeNetCore dataset currently contains a small subset of the available online repositories, which limits the data that we have immediately at hand. However, ShapeNetCore is growing; applying our method to much larger datasets is limited only by the nuisance of preprocessing heterogeneous datasets.
Typical scene graphs in such online repositories are very diverse, including between one and thousands of nodes (Figure 3), and ranging from flat to deep hierarchies (Figure 4). For each category, we also prescribe a list of relevant tags and possible synonyms. We automatically create a list of mostused tags for the entire category, and then manually pick relevant English nouns as the tag dictionary. Note that only a fraction of shapes have any parts with the chosen tags, and the frequency distribution over tag names is very uneven (Table 1, Init column).
For a categories with 2000 shapes, the partbased analysis takes approximately one hour, and the segmentation training takes approximately 10 hours, each on a single Titan X GPU. Once trained, analysis of a new shape typically takes about 25 seconds, of which about 15 seconds is extracting face features with nonoptimized Matlab code.
Hierarchical Mesh Segmentation and Labeling.
Figure 9 demonstrates some representative results produced by our hierarchical segmentation based on connected components (Section 5). The resulting hierarchical segmentation vary in depth from flat (e.g., chairs) to deep (e.g., cars, airplanes), reflecting complexity of the corresponding object. We also often extract consistent part clusters, even if they do not have textual tags. We found that analyzing shapes at the granularity of connected components is usually sufficient: the mean number of connected components per object in ShapeNet is 4169, and the largest connected component in shapes covers only of the total surface area on average: connected components tend to be small. These components are often aligned to part boundaries, for example, if one was to annotate ShapeNet Segmentation benchmark [Yi et al. 2016] by assigning a majority label to each connected component they would get of faces correct.
Segmentation without Connected Components.
In the case of applying perface labeling, when connected components are not available, we observe similar results with this method as to those where the connected components are used. However, a few segments do not come out as cleanlysegmented on more complex models (see Figure 10). Please refer to our supplementary material for qualitative results of this experiment. We tested our method on other datasets (Thingi10k [Zhou and Jacobson 2016], COSEG [Wang et al. 2012]), but were only able to test on a limited set of models, since only a few models in these datasets come from our training categories.
Tag prediction.
Table 1, Final column shows what fraction of training shapes received a particular tag after our partbased analysis (Section 4). Note that an object may be missing a tag for several possible reasons: it could be misclassified, or because the object does not have that part, or does not have it segmented as a separate scene graph node. As evident from this quantitative analysis, the amount of training data we can use in subsequent analysis has drastically increased. Please refer to supplementary material for visual examples from labeling results.
To evaluate tag prediction accuracy, we perform the following experiment. We hold out 30% of the tagged parts during training, and evaluate labeling accuracy on these parts. As our method is based on nearestneighbors (NN) classification, we compare against NN on features computed in the following ways: (1) clustering with LFD, (2) clustering with , (3) our method with no term, and (4) No term. Results are reported in Table 3. As shown in the table, our method significantly improves tag classification performance over the baselines. This experiment also demonstrates the value of our clustering and hierarchy terms and .
Part  Init  Final  Part  Init  Final  Part  Init  Final 
Category: Car (2287 shapes)  
Wheel  17.4  96.4  Mirror  6.9  68.2  Window  9.5  71.8 
Fender  1.1  40.4  Bumper  15.2  63.0  Roof  2.4  48.2 
Exhaust  7.2  57.5  Floor  2.6  49.9  Trunk  4.5  60.6 
Door  19.9  67.8  Spoiler  2.8  41.3  Rim  4.1  74.1 
Headlight  14.6  61.9  Hood  12.2  68.7  Tire  5.3  33.5 
Category: Airplane (2574 shapes)  
Wing  5.9  86.9  Engine  5.5  81.6  Body  2.3  86.8 
Tail  1.5  90.5  Missile  0.4  66.7  
Category: Chair (2401 shapes)  
Arm  1.5  62.4  Leg  3.6  71.0  Back  0.9  50.8 
Seat  2.4  83.1  Wheel  1.4  34.6  
Category: Table (2355 shapes)  
Top  1.9  81.5  Leg  6.4  85.8  
Category: Sofa (1243 shapes)  
BackPillow  4.7  79.3  Seat  2.7  63.0  Feet  3.1  41.5 
Category: Rifle (994 shapes)  
Barrel  1.2  62.9  Bipod  1.7  47.6  Scope  3.2  66.7 
Stock  3.2  56.0  
Category: Bus (713 shapes)  
Seat  4.8  47.0  Wheel  8.3  88.9  Mirror  1.7  42.4 
Category: Guitar (491 shapes)  
Neck  3.1  75.8  Body  0.8  67.6  
Category: Motorbike (281 shapes)  
Seat  23.5  49.2  Engine  21.7  84.0  Gastank  14.6  73.7 
Exhaust  1.8  71.5  Handle  8.9  75.8  Wheel  41.6  98.9 
Cluster evaluation.
Figure 5 (bottom) demonstrates some parts grouped by our method in the partbased analysis (Section 4
). We also note that some clusters combine unrelated parts, and we believe that they serve as null clusters for outliers.
As we do not have ground truth for the unlabeled clusters, we instead evaluate the ability of our learned embedding to cluster parts, with the userprovided labels can serve as ground truth. We split the dataset for a category into training tags and test tags. We run the partbased analysis on all shapes, but provide only the training subset of tags to the algorithm. This gives an embedding , and we can evaluate how good is for clustering. This is done by running means clustering on the parts that correspond to the test tags, with set to the number of test tags. The clustering result is then compared to the test tag labeling by normalized Mutual Information. This process is repeated in a 3fold crossvalidation. The baseline scores are especially low on categories with few parts, like Bus and Table. Table 2 shows quantitative results; our method performs significantly better than the baselines, including using means on Light Field Descriptors (LFD), and omitting the clustering term from the objective.
Comparison to Unsupervised CoHierarchy.
Van Kaick et al. van2013co propose an unsupervised approach for establishing consistent hierarchies within an object category. Their method was developed for small shape collections and requires hours of computation for 20 models, which makes it unsuitable for ShapeNet data. On the other hand, since we assume that some segments have textual tags, we also cannot run our method on their data. Given these constraints, we show a qualitative comparison to their method. In particular, we picked the first car and first airplane in their dataset, and retrieved the most similar models in ShapeNet using lightfield descriptors. Figure 6 demonstrates their and our hierarchies sidebyside. Note that our method generates more detailed hierarchies and also provides textual tags for parts.
Comparison to Supervised Segmentation.
Since there are no largescale hierarchical segmentation benchmarks, we test our method on the segmentation dataset provided by Yi et al. yi2016scalable. We emphasize that the benchmark contains much coarser segmentations than those we can produce, and does not include hierarchies. We take the intersection of our 9 categories and the benchmark, which yields the following six categories for quantitative evaluation: car, airplane, motorbike, guitar, chair, table.
Since other techniques do not leverage connected components, we evaluate perface classification from unary terms only, comparing the perface classification prediction (Eq. 14) to results from Yi et al. yi2016scalable trained only on benchmark data.
Our training data is sampled from a different data distribution than the benchmark; repurposing a model from one training set to another is a problem known as domain adaptation.
The first approach we test is to directly map the labels predicted by our classifier to benchmark labels. The second approach is to obtain 5 training examples from the benchmark, and train a Support Vector Machine classifier to predict benchmark labels from our learned features
(Sec. 4). The resulting classifier is the softmax of , where are the SVM parameters for label. As baseline features, we also test means clustering with LFD features over all input parts, where is the same as the number of clusters used by our method.Results of supervised segmentation comparison experiments are shown in Figure 7. Without training on our features, the method of Yi et al. yi2016scalable requires 50100 benchmark training examples in order to match the results we get with only 5 benchmark examples. Although our method is trained on many ShapeNet meshes, these meshes did not require any manual labeling. This illustrates how our method, trained on freelyavailable data, can be cheaply adapted to a new task.
Figure 8 shows qualitative results from comparison with Yi et al. yi2016scalable, where we use 10 models for training in [Yi et al. 2016] followed by the domain adaptation through using the same 10 models in our approach.
Category  Mean  Car  Airplane  Chair  Table  Motorbike  Bus  Guitar  Rifle  Sofa 

Chance  0.034  0.019  0.010  0.040  0.005  0.020  0.026  0.018  0.176  0.032 
LFD  0.336  0.521  0.315  0.350  0.238  0.576  0.292  0.034  0.379  0.297 
(part features  App. A)  0.348  0.551  0.264  0.352  0.238  0.607  0.313  0.101  0.405  0.297 
No term  0.406  0.626  0.377  0.346  0.124  0.564  0.260  0.498  0.445  0.408 
No term  0.561  0.695  0.568  0.622  0.445  0.659  0.367  0.655  0.514  0.566 
Ours  0.573  0.712  0.575  0.619  0.448  0.678  0.371  0.655  0.526  0.571 
Category  Mean  Car  Airplane  Chair  Table  Motorbike  Bus  Guitar  Rifle  Sofa 

Chance  0.139  0.044  0.136  0.172  0.148  0.149  0.100  0.252  0.092  0.162 
LFD  0.790  0.530  0.823  0.775  0.745  0.829  0.813  0.976  0.723  0.892 
(part features  App. A)  0.823  0.584  0.832  0.812  0.772  0.874  0.822  0.976  0.818  0.920 
No term  0.840  0.694  0.821  0.749  0.910  0.860  0.911  0.982  0.772  0.864 
No term  0.899  0.701  0.934  0.902  0.926  0.865  0.884  0.991  0.936  0.953 
Ours  0.910  0.709  0.97  0.905  0.921  0.878  0.884  0.994  0.951  0.979 
7 Discussions and Conclusion
We have proposed a novel method for mining consistent hierarchical shape models from massive but sparsely annotated scene graphs “in the wild.” As we analyze the input data, we jointly embed parts to a lowdimensional feature space, cluster corresponding parts, and build a probabilistic model for hierarchical relationships among them. We demonstrated that our model can facilitate hierarchical mesh segmentation and were able to extract complex hierarchies and identify small segments in 3D models from various shape categories. Our method can also provide a valuable boost for supervised segmentation algorithms. The goal of our current framework is to extract as much structure as possible from raw noisy, sparsely tagged scene graphs that exist in online repositories. In the future, we believe that using such freelyavailable information will provide enormous opportunities for shape analysis.
Developing Convolutional Neural Networks for surfaces is a very active area right now, e.g.,
[Guo et al. 2015]. Our segmentation training loss functions are largely agnostic to the model representation, and it ought to be straightforward to train a ConvNet on our training loss, for any ConvNet that handles disconnected components.Though effective as evidenced by experimental evaluations, several issues are not completely addressed yet. Our model currently relies on heuristic selection of the number of clusters
, and this could be chosen automatically. We could also relax the assumption that each part with a given label may have only one possible parent label, to allow more general shape grammars [Talton et al. 2012].Our method has obtained about K 3D training models with roughly consistent segmentation, but these have not been humanverified. We also believe that our approach could be leveraged together with crowdsourcing techniques [Yi et al. 2016] to efficiently yield very large, detailed, segmented, and verified shape databases.
It would also be interesting to explore how well the information learned from one object category may transfer to other object categories. For example, “wheel” can be found in “cars” and “motorbikes”, sharing similar geometry and substructures. The observation provides the opportunity for not only the transfer of part embeddings but also the part relationships. With the growth of online model repositories, such transfer learning ability would be more important and relevant towards more efficient expanding of our current dataset.
References
 [Basu et al. 2004] Basu, S., Bilenko, M., and Mooney, R. J. 2004. A probabilistic framework for semisupervised clustering. In Proc. KDD.
 [Belongie et al. 2002] Belongie, S., Malik, J., and Puzicha, J. 2002. Shape matching and object recognition using shape contexts. IEEE TPAMI 24, 24, 509–521.
 [Boykov et al. 2001] Boykov, Y., Veksler, O., and Zabih, R. 2001. Fast approximate energy minimization via graph cuts. IEEE Trans. PAMI.
 [Cappé and Moulines 2009] Cappé, O., and Moulines, E. 2009. Online expectationmaximization algorithm for latent data models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71, 3, 593–613.
 [Chang et al. 2015] Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., and Yu, F., 2015. ShapeNet: An InformationRich 3D Model Repository. arXiv:1512.03012.
 [Chen and Gupta 2015] Chen, X., and Gupta, A. 2015. Webly supervised learning of convolutional networks. In Proc. ICCV.

[Chen et al. 2003]
Chen, D.Y., Tian, X.P., Shen, Y.T., and Ouhyoung, M.
2003.
On visual similarity based 3d model retrieval.
In Computer Graphics Forum (Eurographics).  [Chen et al. 2009] Chen, X., Golovinskiy, A., and Funkhouser, T. 2009. A benchmark for 3d mesh segmentation. In ACM SIGGRAPH, SIGGRAPH, 73:1–73:12.
 [Fisher et al. 2011] Fisher, M., Savva, M., and Hanrahan, P. 2011. Characterizing structural relationships in scenes using graph kernels. In ACM TOG, vol. 30, 34.
 [Glorot and Bengio 2010] Glorot, X., and Bengio, Y. 2010. Understanding the difficulty of training deep feedforward neural networks. In AISTATS.
 [Golovinskiy and Funkhouser 2009] Golovinskiy, A., and Funkhouser, T. 2009. Consistent segmentation of 3D models. Proc. SMI 33, 3, 262–269.
 [Guo et al. 2015] Guo, K., Zou, D., and Chen, X. 2015. 3D mesh labeling via deep convolutional neural networks. ACM TOG 35, 1.
 [Hu et al. 2012] Hu, R., Fan, L., , and Liu, L. 2012. Cosegmentation of 3d shapes via subspace clustering. SGP 31, 5, 1703–1713.

[Huang et al. 2011]
Huang, Q., Koltun, V., and Guibas, L.
2011.
Joint shape segmentation with linear programming.
In ACM SIGGRAPH Asia, 125:1–125:12.  [Huang et al. 2014] Huang, Q., Wang, F., and Guibas, L. 2014. Functional map networks for analyzing and exploring large shape collections. SIGGRAPH 33, 4.
 [Izadinia et al. 2015] Izadinia, H., Russell, B. C., Farhadi, A., Hoffman, M. D., and Hertzmann, A. 2015. Deep classifiers from image tags in the wild. In Proc. Multimedia COMMONS.
 [Jelinek et al. 1992] Jelinek, F., Lafferty, J. D., and Mercer, R. L. 1992. Basic methods of probabilistic context free grammars. In Speech Recognition and Understanding. Springer, 345–360.
 [Johnson and Hebert 1999] Johnson, A. E., and Hebert, M. 1999. Using spin images for efficient object recognition in cluttered 3d scenes. IEEE TPAMI 21, 5, 433–449.
 [Kalogerakis et al. 2010] Kalogerakis, E., Hertzmann, A., and Singh, K. 2010. Learning 3d mesh segmentation and labeling. ACM Transactions on Graphics (TOG) 29, 4, 102.
 [Kim et al. 2013] Kim, V. G., Li, W., Mitra, N. J., Chaudhuri, S., DiVerdi, S., and Funkhouser, T. 2013. Learning partbased templates from large collections of 3d shapes. ACM Transactions on Graphics (TOG) 32, 4, 70.
 [Kingma and Ba 2015] Kingma, D. P., and Ba, J. L. 2015. Adam: A method for stochastic optimization. In Proc. ICLR.
 [Li et al. 2016] Li, X., Uricchio, T., Ballan, L., Bertini, M., Snoek, C. G. M., and Bimbo, A. D. 2016. Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval. ACM Comput. Surv. 49, 1.
 [Liu et al. 2014] Liu, T., Chaudhuri, S., Kim, V. G., Huang, Q.X., Mitra, N. J., and Funkhouser, T. 2014. Creating Consistent Scene Graphs Using a Probabilistic Grammar. SIGGRAPH Asia 33, 6.
 [Mitra et al. 2013] Mitra, N. J., Wand, M., Zhang, H., CohenOr, D., and Bokeloh, M. 2013. Structureaware shape processing. In Eurographics STARs, 175–197.
 [Neal and Hinton 1998] Neal, R. M., and Hinton, G. E. 1998. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models. Springer, 355–368.
 [Ordonez et al. 2011] Ordonez, V., Kulkarni, G., and Berg, T. L. 2011. Im2text: Describing images using 1 million captioned photographs. In Proc. NIPS.
 [Osada et al. 2002] Osada, R., Funkhouser, T., Chazelle, B., and Dobkin, D. 2002. Shape distributions. ACM Transactions on Graphics.
 [Porter 1980] Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130–137.

[Sidi et al. 2011]
Sidi, O., van Kaick, O., Kleiman, Y., Zhang, H., and CohenOr, D.
2011.
Unsupervised cosegmentation of a set of shapes via descriptorspace spectral clustering.
ACM SIGGRAPH Asia 30, 6, 126:1–126:9.  [Talton et al. 2012] Talton, J., Yang, L., Kumar, R., Lim, M., Goodman, N., and Měch, R. 2012. Learning design patterns with bayesian grammar induction. In UIST.
 [Tighe and Lazebnik 2011] Tighe, J., and Lazebnik, S. 2011. Understanding scenes on many levels. In Proc. ICCV.
 [Torresani 2016] Torresani, L. 2016. Weaklysupervised learning. In Computer Vision: A Reference Guide, K. Ikeuchi, Ed.
 [van Kaick et al. 2013] van Kaick, O., Xu, K., Zhang, H., Wang, Y., Sun, S., Shamir, A., and CohenOr, D. 2013. Cohierarchical analysis of shape structures. ACM Transactions on Graphics (TOG) 32, 4, 69.
 [Wang et al. 2011] Wang, Y., Xu, K., Li, J., Zhang, H., Shamir, A., Liu, L., Cheng, Z., and Xiong, Y. 2011. Symmetry Hierarchy of ManMade Objects. Eurographics 30, 2.
 [Wang et al. 2012] Wang, Y., Asafi, S., van Kaick, O., Zhang, H., CohenOr, D., and Chenand, B. 2012. Active coanalysis of a set of shapes. SIGGRAPH Asia.
 [Xie et al. 2014] Xie, Z., Xu, K., Liu, L., and Xiong, Y. 2014. 3d shape segmentation and labeling via extreme learning machine. SGP.
 [Xu et al. 2016] Xu, K., Kim, V. G., Huang, Q., Mitra, N. J., and Kalogerakis, E. 2016. Datadriven shape analysis and processing. SIGGRAPH Asia Course.
 [Yi et al. 2016] Yi, L., Kim, V. G., Ceylan, D., Shen, I., Yan, M., Su, H., Lu, C., Huang, Q., Sheffer, A., and Guibas, L. 2016. A scalable active framework for region annotation in 3d shape collections. TOG 35, 6, 210.

[Yumer et al. 2014]
Yumer, M. E., Chun, W., and Makadia, A.
2014.
Cosegmentation of textured 3d shapes with sparse annotations.
In
2014 IEEE Conference on Computer Vision and Pattern Recognition
, IEEE, 240–247.  [Zhou and Jacobson 2016] Zhou, Q., and Jacobson, A., 2016. Thingi10k: A dataset of 10,000 3dprinting models. arxiv:1605.04797.
Appendix A Part Features and Embedding Network
We compute perpart geometric features which are further used for joint part embedding and clustering (Section 4). The feature vector includes 3view lightfield descriptor [Chen et al. 2003] (with HOG features for each view), centerofmass, bounding box diameter, approximate surface area (fraction of voxels occupied in 30x30x30 object grid), and local frame in PCA coordinate system (represented by matrix ). To mitigate reflection ambiguities for local frame we constraint all frame axes to have positive dot product with axis (typically up) of the global frame. For lightfield descriptor we normalize the part to be centered at origin and have bounding box diameter 1, for all other descriptors we normalize the mesh in the same way. We mitigate reflection ambiguities by constraining all frame axes to have positive dot product with the axis of the global frame. The neural network embedding is visualized in Figure 11, and, in Table 4
, we show the embedding network parameters, where we alter first few fully connected layers to allocate more neurons for richer features such as LFD.
feature  fc1  fc2  fc3  concat  fc4  fc5  fc6 

LFD  128  256  256  512  256  128  64 
PCA Frame  16  32  64  
CoM  16  64  64  
Diameter  8  32  64  
Area  8  32  64 
Appendix B Face Features and Classifier Network
We compute perface geometric features which are further used for hierarchical mesh segmentation (Section 5). These features include spin images (SI) [Johnson and Hebert 1999], shape context (SC) [Belongie et al. 2002], distance distribution (DD) [Osada et al. 2002], local PCA (LPCA) (where
are eigenvalues of local coordinate system, and features are
), local point position variance (LVar), curvature, point position (PP) and normal (PN). To compute local radius for the feature computation we sample 10000 points on the entire shape and use 50 nearest neighbors. We use the same architecture as part embedding network
(Fig. 11) for face classification, but with different loss function (Eq. 13) and network parameters, which are summarized in Table 5.feature  fc1  fc2  fc3  concat  fc4  fc5  fc6 

Curvature  32  64  64  640  256  128  128 
LPCA  64  64  64  
LVar  32  64  64  
SI  128  128  128  
SC  128  128  128  
DD  32  64  64  
PP  16  32  64  
PN  16  32  64 
Appendix C Estep Update
In the Estep, the assignment probabilities are iteratively updated. For each node , the probability that it is assigned to label is updated as:
(18)  
(19) 
where is set of children of node and is the parent node. A joint closedform update to all assignments could be computed using Belief Propagation, but we did not try this.
Comments
There are no comments yet.