The ability to decompose a 3D shape into semantic parts can enable applications from shape retrieval in online repositories, to robotic manipulation and shape generation. Yet, automatic techniques for shape segmentation are limited by the ability to collect labeled training data, which is often expensive or time consuming. Unlike images, online repositories of user-generated 3D shapes, such as the 3D Warehouse repository , contain rich metadata associated with each shape. These include information about geometric primitives (e.g., polygons in 3D meshes) organized in groups, often arranged in a hierarchy, as well as color, material and tags assigned to them. This information reflects the modeling decisions of the designer are likely correlated with high-level semantics.
Despite its abundance, the use of metadata for learning shape representations has been relatively unexplored in the literature. One barrier is the high degree of its variability. These models were created by designers with a diverse set of goals and with a wide range of expertise. As a result the groups and hierarchies over parts of a shape that reflect the modeling steps taken by the designer are highly variable: two similar shapes can have significantly different number of parts as well as the number of levels in the part hierarchy. Moreover, the tags are rarely assigned to parts and are often arbitrarily named. Figures 1 and 2 illustrate this variability.
Our work systematically addresses these challenges and presents an approach to exploit the information present in the metadata to improve the performance of a state-of-the-art 3D semantic segmentation model. Our approach, illustrated in Figure 1, consists of a deep network that maps each point in a 3D shape to a fixed dimensional embedding. The network is trained in a way such that the embedding reflects the user-provided hierarchy and tags. We propose a robust tree-aware metric to supervise the point embedding network that offers better generalization to semantic segmentation tasks over a baseline scheme that is tree-agnostic (only considers the leaf-level groupings). The point embedding network trained on hierarchies also improves over models trained on shape reconstruction tasks that leverage the 3D shape geometry but not their metadata. Finally, when tags are available we show that the embeddings can be fine-tuned leading to further improvements in performance.
On the ShapeNet semantic segmentation dataset, an embedding network pre-trained on hierarchy metadata outperforms a network trained from scratch by reducing relative error by across 16 categories, when trained on shapes per category. Similarly, when only a small fraction of points (20 points) per shape are labeled, the relative reduction in error is . Furthermore, on 5 categories which have sufficient tags, using both the hierarchy and tags reduces error further by points relative to the randomly initialized network, when trained on shapes per category. Our visualizations indicate that the trained networks implicitly learn correspondences across shapes.
2 Related Work
Our work builds on the advances in deep learning architectures for point-based, or local, shape representations and metric learning approaches to guide representation learning. We briefly review relevant work in these areas.
Supervised learning of local shape descriptors.
Several architectures have been proposed to output local representations, or descriptors, for 3D shape points or patches. The architectures can be broadly categorized according to the type of raw 3D shape representation they consume. Volumetric methods learn local patch representations by processing voxel neighborhoods either in uniform  or adaptively subdivided grids [19, 12, 24, 25]. View or multi-view approaches learning local image-based representations by processing local 2D shape projections [7, 22], which can be mapped back onto the 3D shape . Finally, a large number of architectures have been recently proposed for processing raw point clouds. PointNet and PointNet++ are transforming individual point coordinates and optionally normals through MLPs and then performing permutation-invariant pooling operations in local neighborhhoods [17, 18].
All the above-mentioned deep architectures are trained in a fully supervised manner using significant amound of labeled data. Although for some specific classes, like human bodies, these annotations can be easily obtained through template-based matching or synthetically generated shapes [1, 2, 3], for the vast majorities of shapes in online repositories, gathering such annotations often requires laborious user interaction [14, 30]
. Active learning methods have also been proposed to decrease the workload, but still rely on expensive crowdsourcing.
Weak supervision for learning shape descriptors.
A few methods [15, 31] have been recently proposed to avoid expensive point-based annotations. Muralikrishnan et al.  extracts point-wise representations by training an architecture designed to predict shape-level tags (e.g., armrest chair) by first predicting intermediate shape segmentations. Instead of using weak supervision in the form of shape-level tags, we use unlabeled part hierarchies available in massive online repositories and tags for parts (not whole shapes) when such are available. Yi et al. 
embeds pre-segmented parts in descriptor space by jointly learning a metric for clustering parts, assigning tags to them, and building a consistent part hierarchy. In our case, our architecture learns point-wise descriptors and also relaxes the requirement of inferring consistent hierarchies, which might be hard to estimate for shape families with significant structural variability. Non-rigid geometric alignment has been used as a form of weak and noisy supervision by extracting approximate local shape correspondences between pairs of shapes of similar structure or by deforming part templates . However, global shape alignment can fail for shapes with different structure, while part-based alignment requires corresponding parts or definition of part templates in the first place. In a concurrent work, given a collection of shapes from a single category, Chen 
proposed a branched autoencoder that discovers coarse segmentations of shapes by predicting implicit fields for each part. Their network is trained with a few manually selected labeled shapes in a few-shot semantic segmentation setting. Our method instead utilizes part hierarchies and metadata as weak supervisory signal. We also randomly select labeled sets for our few-shot experiments. In general, our method is complementary to all the above-mentioned weak supervision methods. Our weak signal in the form of unlabeled part hierarchies and part tags can be used in conjunction with geometric alignment, consistent hierarchies, or shape-level tags, whenever such are possible to obtain.
Triplet-based metric learning.
Our approach learns a metric embedding over points that reflects the hierarchies in 3D shape collections. Metric learning has a rich literature with a diverse applications and techniques. A popular approach is to supervise the learning with “triplets” of the form to denote that “a is more similar to b than c”. This can be written as where the denotes the distance between and . The distance itself could be computed as the Euclidean distance in some embedding space, i.e., , possibly computed with a deep network. Within this framework, techniques to sample triplets remains an active area of research. These include techniques such as hard-negative mining , semi-hard negative mining  and distance weighted sampling  to bias the sampling of triplets.
3 Mining Metadata from Shape Repositories
We first describe the source of our part hierarchy dataset that we use for training our embedding network. Then we describe the metadata (tags) present in the 3d models and how we extract this information into a consistent dataset.
Several 3D modeling tools, such as SketchUp, Maya, 3DS Max to name a few, allow users to model shapes, and scenes, in general, as a collection of geometric entities (e.g., polygons) organized into groups. The groups can be nested and organized in hierarchies. In our part hierarchy dataset, we endeavor to extract these hieararchies. The shapes in our dataset are a subset of Shapenet Core dataset, where we focus on categories from Shapenet part-segmentation dataset  to allow systematic benchmarking and comparison with prior work. Note that the categories semantic segmentation dataset contains shapes, whereas categories in Shapenet Core dataset contains shapes. We first retrieved the original files for shapes in Shapenet Core dataset provided by 3d warehouse, which are stored in the popular “COLLADA” format . These files represent 3D models in a hierarchical tree structure. Leaf nodes represent shape geometry, and internal nodes represent groups of geometric primitives, or nested groups. Samples from our dataset are visualized in the Figure 2. Number of parts in which a shape is segmented depends on the part-hierarchy as visualized in the Figure 2 (bottom left). Models with too few part segmentation (less than ) or too many (more than ) are discarded. This gives us a total of 3D models having part group information, with each model having at least one level of part grouping. We further segment the dataset into train (), validation () and test () splits. We ensure that the shapes in test split of semantic part-segmentation dataset  are not included in the train split of our part hierarchy dataset.
|Category||Shapes with part tagged||Avg points tagged|
Modeling tools allow users to explicitly give tags to parts, which are stored in their corresponding file format. Obviously, not all designers enter tags for their designed parts. Out of all the models that include part group information in our dataset, we observed that only of the shapes had meaningful tags for at least one part (i.e., tags are sparse). Usually, these tags are not consistent, e.g., a tag for a wheel part in a car can be “wheel_mesh”. To make things worse, few tags have high frequency e.g., one may encounter wheel, chassis, windows (or synthetics of those) frequently as tags, while most of them are rare, or even be non-informative for part types e.g., “geometry123”.
To extract meaningful tags, we selected the most frequent tags encountered as strings, or sub-strings stored in the nodes for each shape category. We also merge synonyms into one tag to reduce number of tags in the final set. For every tag, we find the corresponding geometry nodes and then we label the points sampled from these nodes with the tag. We found that only out of categories have a “sufficient” number of tagged points ( of the original surface points). By “sufficient”, we mean that below this threshold, tags are becoming so sparse in a category that result in negligible improvements. Table 1 shows the distribution of tags in these categories.
We finally aligned the shapes using ICP so that their orientation agrees with the canonical orientation provided for the same shapes in ShapeNet. To process the shapes through our point-based architecture, we uniformly sampled 10K points on their surface. Further details about these steps are provided in the supplementary material.
Our Point Embedding Network (PEN) takes as input a shape in the form of a point cloud set, , where represents the 3D coordinates of each point. Our network learns to map each input shape point to an embedding based on learned network parameters . The architecture is illustrated in Figure 3. PEN first incorporates a PointNet module 
: the points in the input shape are individually encoded into vectorial representations through MLPs, then the resulting point-wise representations are aggregated through max pooling to form a global shape representation. The representation is invariant to the order of the points in the input point set. At the next stage, the learned point-wise representations are concatenated with the global shape representation, and are further transformed through fully-connected layers and ReLUs. In this manner, the point embeddings reflect both local and global shape information.
We used PointNet as a module to extract the initial point-wise and global shape representation mainly due to its efficiency. In general, other point-based modules, or even volumetric [13, 24, 19] and view-based modules [21, 7] for local and global shape processing could be adapted in a similar manner within our architecture. Below we describe the main focus of our work to learn the parameters of the architecture based on part hierarchies and tag data.
Learning from part hierarchies.
Our training takes a standard metric learning approach where the parameters of the PEN are optimized such that pairs originating from the same part sampled from the hierarchy (positive pairs) have distance smaller than pairs of points originating from different parts (negative pairs) in the embedded space. Specifically, given a triplet of points , the loss of the network over this triplet  is defined as:
where , is a scalar margin, and . To avoid degenerate solutions we constrain the embeddings to lie on a unit hypersphere, i.e., , . Given a set of triplets sampled from each shape from our dataset , the triplet objective of the PEN is to minimize the triplet loss:
One simple strategy to sample triplets is to just access the parts at the finest level of segmentation, then sample triplets by randomly taking fixed number of similar pairs from the same part and an equal number of negative points from another part. We call this strategy “leaf” triplet sampling.
An alternative strategy is to consider the part hierarchy tree for triplet sampling. Here, we sample negative point pairs depending on the tree distance between the part groups (tree nodes) they belong to. Given two nodes and , we use the sum of path lengths (number of tree edges) from nodes and to their lowest common ancestor as the tree distance  . For example, if the two nodes are siblings (i.e., two parts belonging to the same larger group), then their lowest common ancestor is their parent and their tree distance is equal to (i.e., count two edges that connect them to their parent). If two nodes are further away in the hierarchy, then tree distance increases. In this manner, the tree distance reflects how far two nodes (parts) are in the hierarchy.
We compute the probability of selecting the positive pair of points from nodeand the negative pair using the point from another node as follows:
Sampling points in this way yields more frequent triplets that consist of negative pairs closer in the hierarchy. Parts that are closer in the hierarchy tend to be spatially or geometrically closer to each other, thus also harder to discriminate. We call this sampling strategy as “hierarchy” triplet sampling. We discuss the effect of these two strategies in the experiments section.
Learning from noisy tag data.
We can also utilize tag data for segments collected from the COLLADA files, as described in Section 3. To train the network using tags, we add two pointwise fully-connected layers on top of the embedding network (PEN). One way to train this network is to define a categorical cross entropy over points whose parts are tagged. However, as shown in Table 1, the total number of tagged points is small. We instead found that a better strategy is to use a one-vs-rest binary cross entropy loss to also make use of points in un-tagged parts. The reason is that if a part is not tagged in a shape that has other parts labeled with tags existing in the shape metadata, then most likely, that part should not be labeled with any of the existing tags for that shape (e.g., if a car has tagged parts as ‘wheel’ and ‘window’, then other un-tagged parts should most likely not be assigned with these tags).
More specifically, for every tag in our tag set for a shape category, we define a binary cross entropy loss by considering all points assigned with that tag as ‘positive’ (set ) while the rest of points assigned with other or no tags as ‘negative’ (set ). Given an output probability prediction for assigning a point with tag , denoted as
produced by the last classification layer (sigmoid layer) of our network, our loss function over tags is defined as follows:
We first train our network to minimize the triplet loss based on our dataset of shapes that contains part hierarchies. Training is done in a cross-category manner on categories111These are the same categories present in Shapenet semantic segmentation dataset from Yi et al.  of ShapenetCore dataset, as described in Section 3. We use the Adam optimizer  with initial learning rate of decayed by the factor of whenever the triplet loss stops decreasing over validation set. The mini-batches consist of shapes. For further efficiency, in each iteration we randomly sample a subset of points (from the original points) for each shape during training. The total number of triplets sampled from a shape is kept constant.
Then for the categories that include tags, we further fine-tune the learned embeddings by learning the two additional pointwise fully-connected layer with a Sigmoid at the end to minimize the tag loss . Since tags are distinct for each category, fine-tuning is done in a category-specific manner (i.e., we produce a different embedding specialized for each of these categories). Although the triplet and tag loss could be combined, we choose a stage-wise training approach since the shapes with part hierarchies are significantly more numerous than the shapes that include tags as shown in Table 1. In our experiments we discuss the effect of training only with the triplet loss, and also the effect of fine-tuning with the tag loss in each category.
For training networks on few-shot learning task, we do hyper-parameters (batch size, epochs, regularization etc.) search using validation set of only one category (‘airplane’) and use the same hyper-parameters setting to train all models on all categories in the few-shot learning task.
Given our network pre-trained on our shape datasets based on part hierarchies and/or tags, we can further train it on other, much smaller, datasets of shapes that include semantic part labels. To do this, once again we add two point-wise fully-connected layers on top of the embedding layer, and a softmax layer to produce semantic part label probabilities. In our experiments, we observe that the part labeling performance is significantly increased when compared to training our network from scratch using semantic part labels only as supervision.
In our implementation, the encoder of our network extracts -dimensional global shape embedding. The decoder concatenates the global embedding with point features from encoder, and finally transform it into a
-dimensional point-wise embeddings. Further details of the layers used in PEN are discussed in the supplementary material. Our implementation is based on PyTorch.
We now discuss experiments performed to evaluate our method and alternatives. First, we present qualitative analysis of learned embeddings, then we discuss a new benchmark we introduce for few-shot segmentation and evaluation metrics, and finally we present results and comparisons of our network with various baselines.
Visualization of the embeddings.
We first present a qualitative analysis of the PEN embeddings. The embeddings learnt using metric learning only (without the tag loss) are visualized in Figure 4 (left). We use the t-SNE algorithm to embed the -dimensional point embedding in space. Interestingly, the descriptors produced by PEN consistently embed the points belonging to similar parts close to each other without explicit semantic supervision. We also visualize the embeddings predicted by PEN trained with metric learning and fine-tuned with tag loss in Figure 4 (right). The embeddings have better correspondence with the tags. Interestingly, the network predicts correct embeddings for points with tags that are not mutually exclusive e.g. ‘cushion’ and ‘back’ of the chair.
Few-shot Segmentation Benchmark.
We anticipate that learning from metadata can improve semantic shape segmentation tasks, especially in the few-shot learning scenario. To this end we have created a new benchmark on ShapeNet segmentation dataset , in which we randomly select fully labeled examples from the complete training set for training the network, where . In this manner, we can test the behaviour of methods with increasing training number of shapes, starting with the few-shot scenario where only a handful of shapes (i.e., 4 or 8) is labeled. The performance is measured as the mean intersection over union (mIOU) across all part labels and shapes in the test splits. We exclude the shapes existing in our part hierarchy and tag datasets used for pre-training PEN from the test splits.
We also introduce one more evaluation setting, where for each shape category, the training shapes have smaller fractions of their original points labeled (, …) labeled points compared to the original points) The case of - labeled point simulates the scenario where semantic annotations are collected through sparse user input (e.g., click few points on shapes and label them).
Since we utilize a vast number of unlabeled data from the same domain it is important to compare with baselines. Our first baseline simply trains PEN from scratch on the training splits of our few-shot segmentation benchmark using only semantic label supervision (without using metadata). Second, we also compare with another baseline, where we train an autoencoder network that leverages only geometry as an alternative to produce point-wise embeddings. This network first encodes the input point cloud to point-wise embeddings producing a -dimensional point-wise representations exactly as in PEN, then a decoder uses upconvolution to reconstruct the original point cloud. The Chamfer distance between generated points and input points is used as a loss function to train this network. We first pre-train the autoencoder on the shapes included in our part hierarchy dataset. After this pre-training step, we retain the encoder and replace the geometry decoder with PEN’s decoder and add two pointwise fully connected layers and a classification layer to produce semantic part label probabilities. The resulting network is then trained in stages, first the decoder and then the entire network at smaller learning rate, on the training splits of our few-shot segmentation benchmark.
Finally, we also evaluate the two strategies to pretrain the embedding network using different triplet sampling techniques leaf-level shape parts (“leaf” triplet sampling) and based on using the hierarchy tree (“hierarchy” triplet sampling) as described in (Section 4).
Next, we compare the performance of our method with the baselines and different sampling strategies under the scenario of using only the triplet loss and cross-category training. Then, we discuss the performance in the case where we additionally use the tag loss.
Few-shot Segmentation Evaluation.
In Figure 5 (left), we plot the mIOU of the baselines along with our method. The plotted mIOU is obtained by taking the average of the mIOU on our test splits over all categories and repeating each experiment times. The network trained from scratch (without any pre-training) has the worst performance. The network based on the pre-trained autoencoder shows some improvement since its point-wise representations reflect local and global geometric structure for the point cloud reconstruction, which can be also relevant to the segmentation task. Our method consistently outperforms the baselines. In particular, the “hierarchy” triplet sampling that uses the part hierarchy trees to choose triplets for training our network performs the best on average. A mIOU improvement ( drop in relative error) is observed compared to training from scratch at training examples - interestingly, the improvement is retained even for training examples. The “hierarchy” triplet sampling also improves over the “leaf” triplet sampling until training examples, then their difference gap between these two strategies is closed.
Evaluating with limited labeled points per shape.
In the previous section we observed the performance of our method and baselines by changing the number of training shapes. Here we also examine the performance in the few-shot setting where we keep the number of training shapes fixed and vary the number of labeled points per training shape. We retrain the above baselines (train from scratch, autoencoder) and triplet sampling strategies (“leaf” and “hierarchy”) with training examples, and vary the number of labeled points as shown in the Figure 5 (right). Again our network using the “hierarchy” triplet sampling performs better than the baselines. It offers better mIOU ( drop in relative error) compared to training from the scratch using labeled points.
Are tags useful?
Here we repeat the two few-shot sementation tasks on 5 shape categories (motorcycle, airplane, table, chair, car) that include some tagged parts in their shape metadata. Here, we examine two more PEN variants: (a) PEN pre-trained using the tag loss only (no triplet loss), then fine-tuned on the training splits of our semantic segmentation benchmark (this baseline is simply called “tags” network), 2) our network pre-trained using triplets loss based on the “hierarchy” sampling, then fine-tuned with the tag loss, and finally further fine-tuned on the training splits of our semantic segmentation benchmark (this baseline is called “Hierarchy+Tags” network). The two PEN variants are trained per each category of the categories. The results are shown in Figure 6.
When using training examples, the Hierarchy+Tags network offers better mIOU ( drop in relative error) on average compared to training from scratch in these 5 categories (refer Figure 6 (left)). An improvement of mIOU ( drop in relative error) is maintained for training examples. Similarly, when using labeled points per shape, Hierarchy+Tags performs mIOU better ( drop in relative error) than training from scratch (refer Figure 6 (right)). In general, the Hierarchy+Tags PEN variant outperforms all other baselines (training from scratch, autoencoder) and also the variant pre-trained using tags only (“Tags” network) on both evaluation settings with limited number of training shapes and limited number of training points. This shows that the combination of pre-training through metric learning on part hierarchies and fine-tuning using tags results in a better, warm starting model for semantic segmentation task.
We presented a method to exploit existing part hierarchies and tag metadata associated with 3D shapes found in online repositories to pre-train deep networks for shape segmentation. The trained network can be used to “warm start” a model for semantic shape segmentation, improving performance in the few-shot setting. Future directions include investigating alternative architectures and combining other types of metadata, such as geometric alignment or material information.
Acknowledgements. This research is funded by NSF (CHS-161733 and IIS-1749833). Our experiments were performed in the UMass GPU cluster obtained under the Collaborative Fund managed by the Massachusetts Technology Collaborative.
-  (2003) The space of human body shapes: reconstruction and parameterization from range scans. ACM Trans. Graph. 22 (3). Cited by: §2.
-  (2005) SCAPE: shape completion and animation of people. ACM Trans. Graph. 24 (3). Cited by: §2.
-  (2014-06) FAUST: dataset and evaluation for 3D mesh registration. In , Piscataway, NJ, USA. Cited by: §2.
-  (2019) BAE-NET: branched autoencoder for shape co-segmentation. CoRR abs/1903.11228. External Links: Cited by: §2.
-  (2017) In defense of the triplet loss for person re-identification. CoRR abs/1703.07737. Cited by: §2.
-  (2014) Deep metric learning using triplet network. CoRR abs/1412.6622. Cited by: §4.
-  (2018) Learning local shape descriptors from part correspondences with multiview convolutional networks. ACM Transactions on Graphics 37 (1). Cited by: §2, §4.
-  (2015) Analysis and synthesis of 3d shape families via deep-learned generative models of surfaces. Computer Graphics Forum 34 (5). Cited by: §2.
-  (2013) Fine-grained semi-supervised labeling of large shape collections. ACM Trans. Graph. 32 (6). Cited by: §2.
-  (2017) 3D shape segmentation with projective convolutional networks. In Proc. CVPR, Cited by: §2.
-  (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. External Links: Cited by: §4.
-  (2017) Escape from cells: deep Kd-Networks for the recognition of 3D point cloud models. In Proc. ICCV, Cited by: §2.
3D convolutional neural networks for landing zone detection from LiDAR. In Proc. ICRA, Cited by: §2, §4.
-  (2019) PartNet: a large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. Cited by: §2.
-  (2018) Tags2Parts: discovering semantic regions from shape tags. In Proc. CVPR, Cited by: §2.
-  Pytorch. Note: https://pytorch.org Cited by: §4.
-  (2017) PointNet: deep learning on point sets for 3D classification and segmentation. In Proc. CVPR, Cited by: §2, §4.
-  (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In Proc. NIPS, Cited by: §2.
-  (2017) Octnet: learning deep 3D representations at high resolutions. In Proc. CVPR, Cited by: §2, §4.
FaceNet: A unified embedding for face recognition and clustering. CoRR abs/1503.03832. External Links: Cited by: §2.
-  (2015) Multi-view convolutional neural networks for 3d shape recognition. In Proc. ICCV, Cited by: §4.
-  (2018) Tangent convolutions for dense prediction in 3D. CVPR. Cited by: §2.
-  Trimble 3D Warehouse. Note: https://3dwarehouse.sketchup.com/ Cited by: §1.
-  (2017) O-CNN: octree-based convolutional neural networks for 3D shape analysis. ACM Trans. Graph. 36 (4). Cited by: §2, §4.
-  (2018) Adaptive o-cnn: a patch-based deep representation of 3d shapes. ACM Trans. Graph. 37 (6). Cited by: §2.
-  (2019) COLLADA. Note: [Online; accessed 5-August-2019] External Links: Cited by: §3.
-  (2019) Lowest common ancestor. Note: [Online; accessed 22-March-2019] External Links: Cited by: §4.
-  (2017-06) Sampling Matters in Deep Embedding Learning. ArXiv e-prints. External Links: Cited by: §2.
-  (2017-07) Learning hierarchical shape segmentation and labeling from online repositories. ACM Trans. Graph. 36. Cited by: §2.
-  (2016) A scalable active framework for region annotation in 3d shape collections. ACM Trans. Graph. 35 (6). Cited by: §2, §3, §5, footnote 1.
-  (2019) CoSegNet: deep co-segmentation of 3d shapes with group consistency loss. CoRR abs/1903.10297. External Links: Cited by: §2.