PartNet: A Large-scale Benchmark for Fine-grained and Hierarchical Part-level 3D Object Understanding

12/06/2018 ∙ by Kaichun Mo, et al. ∙ 8

We present PartNet: a consistent, large-scale dataset of 3D objects annotated with fine-grained, instance-level, and hierarchical 3D part information. Our dataset consists of 573,585 part instances over 26,671 3D models covering 24 object categories. This dataset enables and serves as a catalyst for many tasks such as shape analysis, dynamic 3D scene modeling and simulation, affordance analysis, and others. Using our dataset, we establish three benchmarking tasks for evaluating 3D part recognition: fine-grained semantic segmentation, hierarchical semantic segmentation, and instance segmentation. We benchmark four state-of-the-art 3D deep learning algorithms for fine-grained semantic segmentation and three baseline methods for hierarchical semantic segmentation. We also propose a novel method for part instance segmentation and demonstrate its superior performance over existing methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 6

page 8

page 11

page 16

page 17

Code Repositories

PointCNN

PointCNN: Convolution On X-Transformed Points (NeurIPS 2018)


view repo

partnet_dataset

PartNet Dataset Official Release Repo


view repo

partnet_anno_system

PartNet 3D Web-based Shape Parts Annotation System


view repo

PointCnn

基于作者的版本修改过了一些实用自己的地方


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Being able to parse objects into parts is critical for humans to understand and interact with the world. People recognize, categorize, and organize objects based on the knowledge of their parts [8]. Many actions that people take in the real world require detection of parts and reasoning over parts. For instance, we open doors using doorknobs and pull out drawers by grasping their handles. Teaching machines to analyze parts is thus essential for many vision, graphics, and robotics applications, such as predicting object functionality [11, 12], human-object interactions [16], simulation [18], shape editing [27, 14], and shape generation [23, 39].

To enable part-level object understanding by learning approaches, 3D data with part annotations are in high demand. Many cutting-edge learning algorithms, especially for 3D understanding [43, 42, 29], intuitive physics [25]

, and reinforcement learning 

[45, 28], require such data to train the networks and benchmark the performances. Researchers are also increasingly interested in synthesizing dynamic data through physical simulation engines [18, 41, 28]. Creation of large-scale animatable scenes will require a large amount of 3D data with affordances and mobility information. Object parts serve as a critical stepping stone to access this information. Thus it is necessary to have a big 3D object dataset with part annotation.

Figure 1: PartNet dataset and three benchmarking tasks. Left: we show example annotations at three levels of segmentation in the hierarchy. Right: we propose three fundamental and challenging segmentation tasks and establish benchmarks using PartNet.

 

Dataset #Shape #Part #Category Granularity Semantics Hierarchical Instance-level Consistent

 

Chen et al. [5] 380 4,300 19 Fine-grained No No Yes No
MCL [35] 1,016 7,537 10 Fine-grained Yes No No Yes
Chang et al. [4] 2,278 27,477 90 Fine-grained Yes No Yes No
Yi et al. [43] 31,963 80,323 16 Coarse Yes No No Yes
PartNet (ours) 26,671 573,585 24 Fine-grained Yes Yes Yes Yes

 

Table 1: Comparison to the other shape part datasets.

With the availability of the existing 3D shape datasets with part annotations [5, 3, 43], we witness increasing research interests and advances in 3D part-level object understanding. Recently, a variety of learning methods have been proposed to push the state-of-the-art for 3D shape segmentation [29, 30, 44, 17, 33, 22, 7, 37, 38, 40, 31, 6, 24, 21]. However, existing datasets only provide part annotations on relatively small numbers of object instances [5], or on coarse yet non-hierarchical part annotations [43], restricting the applications that involves understanding fine-grained and hierarchical shape segmentation.

In this paper, we introduce PartNet: a consistent, large-scale dataset on top of ShapeNet [3] with fine-grained, hierarchical, instance-level 3D part information. Collecting such fine-grained and hierarchical segmentation is challenging. The boundary between fine-grained part concepts are more obscure than defining coarse parts. Thus, we define a common set of part concepts by carefully examining the 3D objects to annotate, balancing over several criteria: well-defined, consistent, compact, hierarchical, atomic and complete. Shape segmentation involves multiple levels of granularity. Coarse parts describe more global semantics and fine-grained parts convey richer geometric and semantic details. We organize expert-defined part concepts in hierarchical segmentation templates to guide annotation.

PartNet provides a large-scale benchmark for many part-level object understanding tasks. In this paper, we focus on three fundamental and challenging shape segmentation tasks: fine-grained semantic segmentation, hierarchical semantic segmentation, and instance segmentation. We benchmark four state-of-the-art algorithms on fine-grained semantic segmentation and propose three baseline methods for hierarchical semantic segmentation. We propose the task of part instance segmentation using PartNet. By taking advantages of rich shape structures, we propose a novel method that outperforms the existing baseline algorithm by a clear margin.

PartNet contains highly structured, fine-grained and heterogeneous parts. Our experiments reveals that existing algorithms developed for coarse and homogeneous part understanding cannot work well on PartNet. First, small and fine-grained parts, e.g. door handles and keyboard buttons, are abundant and present new challenges for part recognition. Second, many geometrically similar but semantically different parts requires more global shape context to distinguish. Third, understanding the heterogeneous variation of shapes and parts necessitate hierarchical understanding. We expect that PartNet could serve as a better platform for part-level object understanding in the next few years.

In summary, we make the following contributions:

  • We introduce PartNet, consisting of 573,585 fine-grained part annotations for 26,671 shapes across 24 object categories. To the best of our knowledge, it is the first large-scale dataset with fine-grained, hierarchical, instance-level part annotations;

  • We propose three part-level object understanding tasks to demonstrate the usefulness of this data: fine-grained semantic segmentation, hierarchical semantic segmentation, and instance segmentation.

  • We benchmark four state-of-the-art algorithms for semantic segmentation and three baseline methods for hierarchical segmentation using PartNet;

  • We propose the task of part instance segmentation on PartNet and describe a novel method that outperforms the existing baseline method by a large margin.

Figure 2: PartNet dataset. We visualize example shapes with fine-grained part annotations for the 24 object categories in PartNet.

 

All Bag Bed Bott Bowl Chair Clock Dish Disp Door Ear Fauc Hat Key Knife Lamp Lap Micro Mug Frid Scis Stora Table Trash Vase

 

#A 32537 186 248 519 247 8176 624 241 1005 285 285 840 287 210 514 3408 485 268 252 247 127 2639 9906 378 1160
#S 26671 146 212 464 208 6400 579 201 954 245 247 708 250 174 384 2271 453 212 212 207 88 2303 8309 340 1104
#M 771 20 18 28 20 77 25 20 26 20 19 60 19 18 57 64 20 28 20 20 20 34 91 19 28
#PS 480 4 24 12 4 57 23 12 8 8 15 18 8 3 16 83 8 12 4 13 5 36 82 15 10
#PI 573K 664 9K 2K 615 176K 4K 2K 7K 2K 3K 8K 1K 20K 3K 50K 3K 2K 839 2K 981 77K 177K 8K 5K
Pmed 14 4 33 5 2 19 5 9 8 7 12 9 4 106 7 12 8 7 3 9 8 24 15 9 4
Pmax 230 7 169 7 4 153 32 16 12 20 14 34 5 127 10 230 8 17 6 33 9 220 214 143 200
Dmed 3 1 5 2 1 3 3 3 3 3 3 3 2 1 3 5 2 3 1 3 2 4 4 2 2
Dmax 7 1 5 2 1 5 4 3 3 3 3 3 2 1 3 7 2 3 1 3 2 5 6 2 3

 

Table 2: PartNet statistics. Row #A, #S, #M respectively show the number of shape annotations, the number of distinct shape instances and the number of shapes that we collect multiple annotations. Row #PS and #PI show the number of different part semantics and part instances that we finally collect. Row Pmed and Pmax respectively indicate the median and maximum number of part instances per shape. Row Dmed and Dmax respectively indicate the median and maximum hierarchy depth per shape, with root node as depth 0.

2 Related Work

Understanding shape parts is a long-standing problem in computer vision and graphics. Lacking large-scale annotated datasets, early research efforts evaluated algorithm results qualitatively and conducted quantitative comparison on small sets of 3D models. Attene 

et al[1] compared 5 mesh segmentation algorithms using 11 3D surface meshes and presented side-by-side qualitative comparison. Chen et al[5]

collected 380 surface meshes from 19 object categories with instance-level part decomposition for each shape and proposed quantitative evaluation metrics for shape segmentation. Concurrently, Benhabiles 

et al[2] proposed similar evaluation criteria and methodology. Kalogerakis et al[15] further assigned semantic labels to the segmented components. Shape co-segmentation benchmarks [36, 9] were proposed to study co-segmentation among similar shapes.

Recent advances in deep learning have demonstrated the power and efficiency of data-driven methods on 3D shape understanding tasks such as classification, segmentation and generation. ShapeNet [3] collected a large-scale synthetic 3D CAD models from online open-sourced 3D repositories, including more than 3,000,000 models and 3,135 object categories. Yi et al[43]

took an active learning approach to annotate the ShapeNet models with semantic segmentation for 31,963 shapes covering 16 object categories. In their dataset, each object is usually decomposed into 2

5 coarse semantic parts. PartNet provides more fine-grained part annotations that contains 18 parts per shape on average.

Many recent works studied fine-grained and hierarchical shape segmentation. Yi et al[42] leveraged the noisy part decomposition inputs in the CAD model designs and trained per-category models to learn consistent shape hierarchy. Chang et al[4] collected 27,477 part instances from 2,278 models covering 90 object categories and studied the part properties related to language. Wang et al[35] proposed multi-component labeling benchmark containing 1,016 3D models from ShapeNet [3]

from 10 object categories with manually annotated fine-grained level part semantics and studied to learn neural networks for grouping and labeling fine-grained part components. PartNet proposes a large-scale dataset with 573,585 fine-grained and hierarchical shape part annotations covering 26,671 models from 24 object categories.

There are also many previous works that attempted to understand parts by their functionality and articulation. Hu et al[11] constructed a dataset of 608 objects from 15 object categories annotated with the object functionality and introduced a co-analysis method to learns category-wise object functionality. Hu et al[10] proposed a dataset of 368 mobility units with diverse types of articulation and learned to predict part mobility information from a single static segmented 3D mesh. In PartNet, we assign consistent semantic labels that entail such functionality and articulation information for part components within each object category, which potentially makes PartNet support such research.

3 Data Annotation

The data annotation is performed in a hierarchical manner. Expert-defined hierarchical part templates are provided to guarantee labeling consistency among multiple annotators. We design a single-thread question-answering 3D GUI to guide the annotation. We hire 66 professional annotators and train them for the annotation. The average annotation time per shape is 8 minutes, and at least one pass of verification is performed for each annotation to ensure accuracy.

3.1 Expert-Defined Part Hierarchy

Figure 3: We show the expert-defined hierarchical template for lamp (middle) and the instantiations for a table lamp (left) and a ceiling lamp (right). The And-nodes are drawn in solid lines and Or-nodes in dash lines. The template is deep and comprehensive to cover structurally different types of lamps. In the meantime, the same part concepts, such as light bulb and lamp shade, are shared across the different types.

Shape segmentation naturally involves hierarchical understanding. People understand shapes at different segmentation granularity. Coarse parts convey global semantics while fine-grained parts provide more detailed understanding. Moreover, fine-grained part concepts are more obscure to define than coarse parts. Different annotators have different knowledge and background so that they may name parts differently when using free-form annotation [4]. To address the issues, we introduce And-Or-Graph-style hierarchical templates and collect part annotations according to the pre-defined templates.

Due to the lack of well-acknowledged rules of thumb to define good templates, the task of designing hierarchical part templates for a category becomes a non-trivial task. Furthermore, the requirement for the designed template to cover all variations of shapes and parts, makes the problem more challenging. Below we summarize the criteria that we use to guide our template design:

  • Well-defined: Part concepts are well-delineated such that parts are identifiable by multiple annotators;

  • Consistent: Part concepts are shared and reused across different parts, shapes and object categories;

  • Compact: There is no unnecessary part concept and part concepts are reused when it is possible;

  • Hierarchical: Part concepts are organized in a taxonomy to cover both coarse and fine-grained parts;

  • Atomic: Leaf nodes in the part taxonomy consist of primitive, non-decomposable shapes;

  • Complete: The part taxonomy covers a heterogeneous variety of shapes as completely as possible.

Guided by these general principles, we build an And-Or-Graph-style part template for each object category. The templates are defined by experts after examining a broad variety of objects in the category. Each template is designed in a hierarchical manner from the coarse semantic parts to the fine-grained primitive-level components. Figure 3 (middle) shows the lamp template. And-nodes segment a part into small subcomponents. Or-nodes indicate subcategorization for the current part. The combination of And-nodes and Or-nodes allows us to cover structurally different shapes using the same template while sharing as much common part labels as possible. As in Figure 3 (left) and (right), both table lamps and ceiling lamps are explained by the same template through the first-level Or-node for lamp types.

Despite the depth and comprehensiveness of these templates, it is still impossible to cover all cases. Thus, we allow our annotators to improve upon the structure of the template and to annotate parts that are out of the scope of our definition. We also conduct template refinements to resolve part ambiguity after we obtain the data annotation according to the original templates. To systematically identify ambiguities, we reserve a subset of shapes from each class and collect multiple human annotations for each shape. We compute the confusion matrix among different annotators and address data inconsistencies. For example, we merge two concepts with high confusion scores or remove a part if it is frequently segmented in the wrong way. We provide more details about this in the supplementary material.

3.2 Annotation Interface

Figure 4: We show our annotation interface with its components, the proposed question-answering workflow and the mesh cutting interface.

Figure 4 (a) shows our web-based annotation interface. Based on the template hierarchy, the annotation process is designed to be a single-thread question-answering workflow, traversing the template graph in a depth-first manner, as shown in Figure 4 (b). Starting from the root node, the annotator is asked a sequence of questions. The answers automatically construct the final hierarchical segmentation for the current shape instance. For each question, the annotator is asked to mark the number of subparts (And-node) or pick one among all subtypes (Or-node) for a given part. For each leaf node part, the annotator annotates the part geometry in the 3D interface. To help them understand the part definition and specification, we provide rich textual definitions and visual examples for each part. In addition, our interface supports cross-section and visibility control to annotate the interior structure of a 3D model.

The collected 3D CAD models often include original mesh subgroups and part information. Some of the grouping information is detailed enough to determine the final segmentation we need. Considering this, we provide the annotators with the original groupings at the beginning of the annotation, to speed up annotation. The annotators can simply select multiple predefined pieces to form a part of the final segmentation. We also provide mesh cutting tools to split large pieces into smaller ones following [5], when the original groupings are coarser than the desired segmentation, as shown in Figure 4 (c). The annotators draw boundary lines on the remeshed watertight surface [13] and the mesh cutting algorithm automatically splits the mesh into multiple smaller subcomponents.

In contrast to prior work, our UI is designed for operating directly on 3D models and collecting fine-grained and hierarchical part instances. Compared to Yi et al[43] where the annotation is performed in 2D, our approach allows the annotators to directly annotate on the 3D shapes and thus be able to pick up more subtle part details that are hidden from 2D renderings. Chang et al[4] proposes a 3D UI that paints regions on mesh surfaces for part labeling. However, their interface is limited to existing over-segmentations on part components and does not support hierarchical annotations.

4 PartNet Dataset

The final PartNet dataset provides fine-grained and hierarchical instance-level part segmentation annotation for shapes with part instances from object categories. Most of the shapes and object categories are from ShapeNetCore [3]. We supplement object categories that are commonly present in indoor scenes (i.e. scissors, refrigerators, and doors) and augment of the existing categories with more 3D models from 3D Warehouse111https://3dwarehouse.sketchup.com.

Figure 2 and Table 2 show the PartNet data and statistics. More visualization and statistics are included in supplemental material. Our templates define hierarchical segmentation with depth in median and maximum. In total, we annotate fine-grained part instances, with a median of parts per shape and a maximum of . To study annotation consistency, we also collect a subset of shapes and ask for multiple annotations per shape.

5 Tasks and Benchmarks

We benchmark three part-level object understanding tasks using PartNet: fine-grained semantic segmentation, hierarchical semantic segmentation and instance segmentation. Four state-of-the-art algorithms for semantic segmentation are evaluated and three baseline methods are proposed for hierarchical segmentation. Moreover, we propose a novel method for instance segmentation that outperforms the existing baseline method.

Data Preparation.

In this section, we only consider parts that can be fully determined by their shape geometry222Although 3D models in ShapeNet [3] come with face normal, textures, material and other information, there is no guarantee for the quality of such information. Thus, we leave this as a future work.. We ignore the parts in evaluation that require additional information to identify, such as the glass parts on the cabinet doors which opacity is needed to identify, and the buttons on microwaves that texture information is desired to distinguish it from the main frame. We also remove rarely appeared parts from the evaluation, as the lacking of samples is insufficient for training and evaluating networks.

We sample points from each CAD model with furthest point sampling and use the 3D coordinates as the neural network inputs for all the experiments in the paper. The proposed dataset is split into train, validation and test sets with the ratio 70%: 10%: 20%. The shapes with multiple human annotations are not used in the experiments.

5.1 Fine-grained Semantic Segmentation

 

Avg Bag Bed Bott Bowl Chair Clock Dish Disp Door Ear Fauc Hat Key Knife Lamp Lap Micro Mug Frid Scis Stora Table Trash Vase

 

P1 57.9 42.5 32.0 33.8 58.0 64.6 33.2 76.0 86.8 64.4 53.2 58.6 55.9 65.6 62.2 29.7 96.5 49.4 80.0 49.6 86.4 51.9 50.5 55.2 54.7
P2 37.3 20.1 38.2 55.6 38.3 27.0 41.7 35.5 44.6 34.3
P3 35.6 13.4 29.5 27.8 28.4 48.9 76.5 30.4 33.4 47.6 32.9 18.9 37.2 33.5 38.0 29.0 34.8 44.4
Avg 51.2 42.5 21.8 31.7 58.0 43.5 30.8 60.2 81.7 44.4 43.3 53.1 55.9 65.6 47.6 25.2 96.5 42.8 80.0 39.5 86.4 44.8 37.9 45.0 49.6

 

P+1 65.5 59.7 51.8 53.2 67.3 68.0 48.0 80.6 89.7 59.3 68.5 64.7 62.4 62.2 64.9 39.0 96.6 55.7 83.9 51.8 87.4 58.0 69.5 64.3 64.4
P+2 44.5 38.8 43.6 55.3 49.3 32.6 48.2 41.9 49.6 41.1
P+3 42.5 30.3 41.4 39.2 41.6 50.1 80.7 32.6 38.4 52.4 34.1 25.3 48.5 36.4 40.5 33.9 46.7 49.8
Avg 58.1 59.7 40.3 47.3 67.3 50.3 44.8 62.0 85.2 47.1 53.5 58.6 62.4 62.2 49.5 32.3 96.6 50.8 83.9 43.4 87.4 49.4 48.2 55.5 57.1

 

S1 60.4 57.2 55.5 54.5 70.6 67.4 33.3 70.4 90.6 52.6 46.2 59.8 63.9 64.9 37.6 30.2 97.0 49.2 83.6 50.4 75.6 61.9 50.0 62.9 63.8
S2 41.7 40.8 39.6 59.0 48.1 24.9 47.6 34.8 46.0 34.5
S3 37.0 36.2 32.2 30.0 24.8 50.0 80.1 30.5 37.2 44.1 22.2 19.6 43.9 39.1 44.6 20.1 42.4 32.4
Avg 53.6 57.2 44.2 43.4 70.6 45.7 29.1 59.8 85.4 43.7 41.7 52.0 63.9 64.9 29.9 24.9 97.0 46.9 83.6 41.4 75.6 50.8 34.9 52.7 48.1

 

C1 64.3 66.5 55.8 49.7 61.7 69.6 42.7 82.4 92.2 63.3 64.1 68.7 72.3 70.6 62.6 21.3 97.0 58.7 86.5 55.2 92.4 61.4 17.3 66.8 63.4
C2 46.5 42.6 47.4 65.1 49.4 22.9 62.2 42.6 57.2 29.1
C3 46.4 41.9 41.8 43.9 36.3 58.7 82.5 37.8 48.9 60.5 34.1 20.1 58.2 42.9 49.4 21.3 53.1 58.9
Avg 59.8 66.5 46.8 45.8 61.7 53.6 39.5 68.7 87.4 50.2 56.5 64.6 72.3 70.6 48.4 21.4 97.0 59.7 86.5 46.9 92.4 56.0 22.6 60.0 61.2

 

Table 3: Fine-grained semantic segmentation results (part-category mIoU %). Algorithm P, P+, S and C refer to PointNet [29], PointNet++ [30], SpiderCNN [40] and PointCNN [24], respectively. The number 1, 2 and 3 refer to the three levels of segmentation: coarse-, middle- and fine-grained. We put short lines for the levels that are not defined.

Recent advances of 3D semantic segmentation [29, 30, 44, 17, 33, 22, 7, 37, 38, 40, 31, 6, 24, 21] have accomplished promising achievement in coarse-level segmentation on the ShapeNet Part dataset [3, 43]. However, few work focus on the fine-grained 3D semantic segmentation, due to the lack of large-scale fine-grained dataset. With the help of the proposed PartNet dataset, researchers can now work on this more challenging task with little overhead.

Fine-grained 3D semantic segmentation requires recognizing and distinguishing small and similar semantic parts. For example, door handles are usually small, out of points on average in PartNet, but semantically important on doors. Beds have several geometrically similar parts such as side vertical bars, post bars and base legs. To recognize the subtle part details, segmentation systems need to understand them locally, through discriminative features, and globally, in the context of the whole shape.

Benchmark Algorithms.

We benchmark four state-of-the-art semantic segmentation algorithms on the fine-grained PartNet segmentation: PointNet [29], PointNet++ [30], SpiderCNN [40] and PointCNN [24]333There are many other algorithm candidates: [44, 17, 33, 22, 7, 37, 38, 31, 6, 21]. We will host an online leadboard to report the performances.. PointNet [29] takes unordered point sets as inputs and extracts features for shape classification and segmentation. To better learn local geometric features, the follow-up work PointNet++ [30]

proposes a hierarchical feature extraction scheme. SpiderCNN 

[40] extends traditional convolution operations on 2D images to 3D point clouds by parameterizing a family of convolutional filters. To organize the unordered points into latent canonical order, PointCNN [24] proposes to learn -transformation, and applies -convolution operations on the canonical points.

We train the four methods on the dataset, using the default network architectures and hyperparameters described in their papers. Instead of training a single network for all object categories as done in most of these papers, we train a network for each category at each segmentation level. We input only the 3D coordinates for fair comparison

444PointNet++ [30] and SpiderCNN [40] use point normals as additional inputs. For fair comparison, we only input the 3D coordinates. and train the networks until convergence. More training details are described in the supplementary material.

Figure 5: Qualitative results for semantic segmentation. The top row shows the ground-truth and the bottom row shows the PointCNN prediction. The black points indicate unlabeled points.

Evaluation and Results.

We evaluate the algorithms at three segmentation levels for each object category: coarse-, middle- and fine-grained. The coarse level approximately corresponds to the granularity in Yi et al[43]. The fine-grained level refers to the segmentation down to leaf levels in the segmentation hierarchies. For structurally deep hierarchies, we define the middle level in between. Among 24 object categories, all of them have the coarse levels, while 9 have the middle levels and 17 have the fine levels. Overall, we define 50 segmentation levels for 24 object categories.

In Table 3, we report the semantic segmentation performances at multiple levels of granularity on PartNet. We use the mean Intersection-over-Union (mIoU) scores as the evaluation metric. After removing unlabeled ground-truth points, for each object category, we first calculate the IoU between the predicted point set and the ground-truth point set for each semantic part category across all test shapes. Then, we average the per-part-category IoUs to compute the mIoU for the object category. We further calculate the average mIoU across different levels for each object category and finally report the average cross all object categories.

Unsurprisingly, performance for all four algorithms drop by a large margin from the coarse level to the fine-grained level. Figure 5 shows qualitative results from PointCNN. The method does not perform well on small parts, such as the door handle on the door example, and visually similar parts, such as stair steps and the horizontal bars on the bed frame. How to learn discriminative features that better capture both local geometry and global context for these issues would be an interest topic for future works.

5.2 Hierarchical Semantic Segmentation

 

Avg Bed Bott Chair Clock Dish Disp Door Ear Fauc Knife Lamp Micro Frid Stora Table Trash Vase

 

  Bottom-Up 51.2 40.8 56.1 47.2 38.3 61.5 84.1 52.6 54.3 63.4 52.3 36.8 48.2 41.0 46.8 38.3 53.6 54.4
  Top-Down 50.8 41.1 56.2 46.5 34.3 54.5 84.7 50.6 59.5 61.4 55.6 37.1 48.8 41.6 45.2 37.0 53.5 55.6
  Ensemble 51.7 42.0 54.7 48.1 44.5 58.8 84.7 51.4 57.2 61.9 51.9 37.6 47.5 41.4 47.3 44.0 52.8 53.1

 

Table 4: Hierarchical segmentation results (part-category mIoU %). We present the hierarchical segmentation performances for three baseline methods: bottom-up, top-down and ensemble. We conduct experiments on 17 out of 24 categories with tree depth bigger than 1.

Shape segmentation is hierarchical by its nature. From coarse semantics to fine-grained details, hierarchical understanding on 3D objects develops a holistic and comprehensive reasoning on the shape components. For this purpose, we study hierarchical semantic segmentation problem that predicts semantic part labels in the entire shape hierarchies that cover both coarse- and fine-grained part concepts.

A key problem towards hierarchical segmentation is how to leverage the rich part relationships on the given shape templates in the learning procedure. Recognizing a chair base as a finer-level swivel base significantly reduces the solution space for detecting more fine-grained parts such as central supporting bars, star-base legs and wheels. On the other hand, the lack of a chair back increases the possibility that the object is a stool. Different from Sec. 5.1 where we consider the problem at each segmentation level separately, hierarchical segmentation requires a holistic understanding on the entire part hierarchy.

Benchmark Algorithms.

We propose three baseline methods to tackle hierarchical segmentation: bottom-up, top-down and ensemble. The bottom-up method considers only the leaf-node parts during training and groups the prediction of the children nodes to parent nodes as defined in the hierarchies in the bottom-up inference. The top-down method learns a multi-labeling task over all part semantic labels on the tree and conducts a top-down inference by classifying coarser-level nodes first and then finer-level ones. For the ensemble method, we train flat segmentation at multiple levels as defined in Sec. 

5.1 and conduct joint inference by calculating the average log-likelihood scores over all the root-to-leaf paths on the tree. We use PointNet++ [30] as the backbone network in this work, and other methods listed in Sec. 5.1 can also be used. More architecture and training details are described in the supplementary material.

Evaluation and Results.

Table 8 demonstrates the performances of the three baseline methods. We calculate mIoU for each part category and compute the average over all the tree nodes as the evaluation metric. The experimental results show that the three methods perform similarly with small performance gaps. The ensemble method performs slightly better over the other two, especially for the categories with rich structural and sub-categorization variation, such as chair, table and clock.

The bottom-up method only considers leaf-node parts in the training. Although the template structure is not directly used, the parent-node semantics of leaf nodes are implicitly encoded in the leaf-node part definitions. For example, the vertical bars for chair backs and chair arms are two different leaf nodes. The top-down method explicitly leverages the tree structures in both the training and the testing phases. However, prediction errors are accumulated through top-down inference. The ensemble method decouples the hierarchical segmentation task into individual tasks at multiple levels and performs joint inference, taking the predictions at all levels into consideration. Though demonstrating better performances, it has more hyper-parameters and requires longer training time for the multiple networks.

5.3 Instance Segmentation

 

Avg Bag Bed Bott Bowl Chair Clock Dish Disp Door Ear Fauc Hat Key Knife Lamp Lap Micro Mug Frid Scis Stora Table Trash Vase

 

S1 55.7 38.8 29.8 61.9 56.9 72.4 20.3 72.2 89.3 49.0 57.8 63.2 68.7 20.0 63.2 32.7 100 50.6 82.2 50.6 71.7 32.9 49.2 56.8 46.6
S2 29.7 15.4 25.4 58.1 25.4 21.7 49.4 22.1 30.5 18.9
S3 29.5 11.8 45.1 19.4 18.2 38.3 78.8 15.4 35.9 37.8 38.3 14.4 32.7 18.2 21.5 14.6 24.9 36.5
Avg 46.8 38.8 19.0 53.5 56.9 39.1 19.3 56.2 84.0 29.9 46.9 50.5 68.7 20.0 50.7 22.9 100 44.2 82.2 30.3 71.7 28.3 27.5 40.9 41.6

 

O1 62.6 64.7 48.4 63.6 59.7 74.4 42.8 76.3 93.3 52.9 57.7 69.6 70.9 43.9 58.4 37.2 100 50.0 86.0 50.0 80.9 45.2 54.2 71.7 49.8
O2 37.4 23.0 35.5 62.8 39.7 26.9 47.8 35.2 35.0 31.0
O3 36.6 15.0 48.6 29.0 32.3 53.3 80.1 17.2 39.4 44.7 45.8 18.7 34.8 26.5 27.5 23.9 33.7 52.0
Avg 54.4 64.7 28.8 56.1 59.7 46.3 37.5 64.1 86.7 36.6 48.5 57.1 70.9 43.9 52.1 27.6 100 44.2 86.0 37.2 80.9 35.9 36.4 52.7 50.9

 

Table 5: Instance segmentation results (part-category mAP %, IoU threshold 0.5). Algorithm S and O refer to SGPN [34] and our proposed method respectively. The number 1, 2 and 3 refer to the three levels of segmentation: coarse-, middle- and fine-grained. We put short lines for the levels that are not defined.

The goal of instance segmentation is to detect every individual part instance and segment it out from the context of the shape. Many applications in computer graphics, vision and robotics, including manufacturing, assembly, interaction and manipulation, require the instance-level part recognition. Compared to detecting objects from scenes, parts on objects usually have stronger and more intertwined structural relationships. The existence of many visually-similar but semantically-different parts makes the part detection problem challenging. To the best of our knowledge, this work is the first to provide a large-scale shape part instance-level segmentation benchmark.

Given a shape point cloud as input, the task of part instance segmentation is to provide several disjoint masks over the entire point cloud, each of which corresponds to an individual part instance on the object. We adopt the part semantics from the defined segmentation levels in Sec. 5.1. The detected masks should have no overlaps, but they together do not necessarily cover the entire point cloud, as some points may not belong to any part of interests.

Figure 6: The proposed detection-by-segmentation method for instance segmentation. The network learns to predict three components: the semantic label for each point, a set of disjoint instance masks and their confidence scores for part instances.

Benchmark Algorithms.

We propose a novel detection-by-segmentation network to address instance segmentation. We illustrate our network architecture in Figure 6. We use PointNet++ [30] as the backbone network for extracting features and predicting both semantic segmentation for each point and instance segmentation masks over the input point cloud of size . Moreover, we train a separate mask for the points without semantic labels in the ground-truth. A softmax activation layer is applied to encourage the mutual exclusiveness among different masks so that . To train the network, we apply Hungarian algorithm [20] to find a bipartite matching between the prediction masks and the ground-true masks , and regress each prediction to the matched ground-truth mask . We employ a relaxed version of IoU [19] defined as , as the metric for Hungarian algorithm. In the meanwhile, a separate branch is trained to predict confidence scores for the predicted masks .

The loss function is defined as

, combining five terms: a cross-entropy semantic segmentation loss , an IoU loss for mask regression , an IoU loss for the unlabeled points , a prediction-confidence loss and a -norm regularization term to encourage unused prediction masks to vanish [32]. We use , , , , and for all the experiments.

We compare the proposed method with SGPN [34], which learns similarity scores among all pairs of points and detect part instances by grouping points that share similar features. We follow most of the default settings and hyper-parameters described in their paper. We first pre-train PointNet++ semantic segmentation branch and then fine-tune it for improving the per-point feature similarity matrix and confidence maps. We use margin values of 1 and 2 for the double-hinge loss as suggested by the authors of [34], instead of 10 and 80 in the original paper. We feed 10,000 points to the network at a time, and use a batch-size of 32 in the pre-training and 1 in the fine-tuning.

Figure 7: Qualitative results for instance segmentation. Our method produces more robust and cleaner results than SGPN.
Figure 8: Learned instance correspondences. The corresponding parts are marked with the same color.

Evaluation and Results.

Table 9 reports the per-category mean Average Precision (mAP) scores for SPGN and our proposed method. For each object category, the mAP score calculates the AP for each semantic part category across all test shapes and averages the AP across all part categories. Finally, we take the average of the mAP scores across different levels of segmentation within each object category and then report the average over all object categories. We compute the IoU between each prediction mask and the closest ground-truth mask and regard a prediction mask as true positive when IoU is larger than .

Figure 7 shows qualitative comparisons for our proposed method and SGPN. Our proposed method produces more robust and cleaner instance predictions. After learning for point features, SGPN has a post-processing stage that merges points with similar features as one component. This process involves many thresholding hyper-parameters. Even though most parameters are automatically inferred from the validation data, SPGN still suffers from predicting partial or noisy instances in the case of bad thresholding. Our proposed method learns structural priors within each object category that is more instance-aware and robust in predicting complete instances. We observe that training for a set of disjoint masks across multiple shapes gives us consistent part instances. We show the learned part correspondence in Figure 8.

6 Conclusion

We introduce PartNet: a large-scale benchmark for fine-grained, hierarchical, and instance-level 3D shape segmentation. It contains part annotations for ShapeNet [3] models from object categories. Based on the dataset, we propose three shape segmentation benchmarks: fine-grained semantic segmentation, hierarchical semantic segmentation and instance segmentation. We benchmark four state-of-the-art algorithms for semantic segmentation and propose a novel method for instance segmentation that outperforms the existing baseline method. Our dataset enables future research directions such as collecting more geometric and semantic annotation on parts, investigating shape grammars for synthesis and animating object articulation in virtual environments for robotic learning.

Acknowledgements

This research was supported by NSF grants CRI-1729205 and IIS-1763268, a Vannevar Bush Faculty Fellowship, a Google fellowship, and gifts from Autodesk, Google and Intel AI Lab. We especially thank Zhe Hu from Hikvision for the help on data annotation and Linfeng Zhao for the help on preparing hierarchical templates. We appreciate the 66 annotators from Hikvision, Ytuuu and Data++ on data annotation.

References

Appendix A Overview

This document provides additional dataset visualization and statistics (Sec B), hierarchical template design details and visualization (Sec C), and the architectures and training details for the three shape segmentation tasks (Sec D), to the main paper.

Appendix B More Dataset Visualization and Statistics

We present more visualization and statistics over the proposed PartNet dataset.

b.1 More Fine-grained Segmentation Visualization

Figure 13 and 14 show more visualization for fine-grained instance-level segmentation annotations in PartNet. We observe the complexity of the annotated segmentation and the heterogeneous variation of shapes within each object category.

b.2 More Hierarchical Segmentation Visualization

Figure 15, 16 and 17 show more visualization for example hierarchical instance-level segmentation annotations in PartNet. We visualize the tree-structure of the hierarchical segmentation annotation with the 2D part renderings associated to the tree nodes.

b.3 Shape Statistics

We report the statistics for the number of annotations, unique shapes and shapes that we collect multiple human annotations in Figure 9.

b.4 Part Statistics

We report the statistics for the number of part semantics for each object category in Figure 10. We also present the statistics for the maximum and median number of part instances per shape for each object category in Figure 11. We report the statistics for the maximum and median tree depth for each object category in Figure 12.

Appendix C More Template Design Details and Visualization

We provide more details and visualization for the expert-defined hierarchical templates to guide the hierarchical segmentation annotation and the template refinement procedure to resolve annotation inconsistencies.

Figure 9: PartNet shape statistics. We report the statistics for the number of annotations, unique shapes and shapes that we collect multiple human annotations.
Figure 10: PartNet part semantics statistics. We report the statistics for the number of part semantics for each object category.
Figure 11: PartNet part instance statistics. We report the statistics for the maximum and median number of part instances per shape for each object category.
Figure 12: PartNet tree depth statistics. We report the statistics for the maximum and median tree depth for each object category.

c.1 Template Design Details

We design templates according to the rules of thumb that we describe in the main paper. We also consulted many online references555E.g. http://www.props.eric-hart.com/resources/parts-of-a-chair/. that describe object parts (often for manufacturing and assembly) and previous works that relate language to the shapes [4] as guides for the design of our template. To ensure that our templates cover most of the shape variations and part semantics of each object category, we generated a t-SNE [26] visualization for the entire shape space to study the shape variation. We trained an auto-encoder based on the shape geometry within each object category to obtain shape embeddings for the t-SNE visualization.

 

Avg Bag Bed Bott Bowl Chair Clock Dish Disp Door Ear Fauc Hat Key Knife Lamp Lap Micro Mug Frid Scis Stora Table Trash Vase

 

Oavg 69.8 54.8 70.0 87.5 87.7 59.0 62.1 67.3 85.2 64.4 74.1 69.9 86.8 77.0 75.0 44.6 61.6 71.0 91.0 65.3 88.7 68.0 40.1 51.4 72.6
Oavg 19.0 29.3 17.4 6.9 8.1 26.7 24.6 19.1 17.8 15.2 15.6 17.6 9.5 17.9 14.6 29.0 27.1 19.3 10.6 27.7 9.0 21.3 29.3 23.3 19.5

 

Ravg 83.3 82.1 76.2 89.3 91.7 77.8 91.1 81.5 94.0 77.0 83.0 84.7 89.3 89.6 77.8 72.7 78.3 84.4 91.7 85.1 90.2 77.1 71.4 71.0 92.3
Rstd 10.4 11.1 9.2 6.0 7.4 15.2 7.0 7.2 2.8 11.2 13.5 8.6 10.3 3.5 14.9 17.3 14.6 9.1 9.7 6.2 8.6 12.8 22.6 13.2 7.5

 

Table 6: The average confusion scores and the standard deviations for multiple annotations (%).

We report the average confusion scores and the standard deviations by calculating over the entries on the diagonal of the confusion matrix for each object category using the small subset of shapes that we collect multiple human annotations. Rows

O and R respectively refer to the scores before and after the template refinement process.

Although we try to cover the most common part semantics in our templates, it is still not easy to cover all possible object parts. Thus, we allow annotators to deviate from the templates and define their own parts and segmentation structures. Among all the annotated part instances, 1.3% of them are defined by the annotators. In the raw annotation, 13.1% of shapes contained user-defined part labels.

Our analysis shows that our template designs are able to cover most of the ShapeNet [3] shapes. Of the 27,260 shapes we collected in total, our annotators successfully labeled 26,671 of them, giving our templates a coverage rate of at least 97.8% for ShapeNet shapes. While template coverage is a potential issue, the remaining 2.2% were not annotated mainly due to other issues such as poor mesh quality, classification error, error during mesh splitting, etc.

We design hierarchical templates that cover both the coarse-level part semantics and fine-grained part details down to the primitive level, e.g. chair back vertical bar and bed base surface panel. Most primitive-level parts are atomic such that they are very unlikely to be further divided for end applications. If an application requires different segmentation hierarchy or level of segmentation than the ones we already provide in our template, developers and researchers can try to build up their own segmentation based upon the atomic primitives we obtain in PartNet.

Moreover, we try our best to make the shared part concepts among different shapes and even different object categories share the same part labels. For example, we use the part label leg for table, chair, lamp base, etc. and the part label wheel for both chair swivel base wheel and refrigerator base wheel. Such part concept sharing provides rich part correspondences within a specific object category and across multiple object categories.

c.2 Template Refinement Details

Fine-grained shape segmentation is challenging to annotate due to the subtle concept gaps among similar part semantics. Even though we provide detailed textual and visual explanation for our pre-defined parts, we still observe some annotation inconsistencies across multiple annotators. To quantitatively diagnose such issues, we reserve a small subset of shapes for which we collect multiple human annotations. Then, we compute the confusion scores among the predefined parts across the multiple annotations and conduct careful template refinement to reduce the part ambiguity.

There are primarily three sources of such inconsistencies: boundary ambiguity, granularity ambiguity and part labeling ambiguity. Boundary ambiguity refers to the unclear boundary between two parts, which is also commonly seen in previous works [5, 43]. For example, the boundary between the bottle neck and the bottle body is not that clear for wine bottles. Granularity ambiguity means that different annotators have different understanding about the segmentation granularity of the defined parts. One example is that, for a curvy and continuous chair arm, one can regard it as a whole piece or imagine the separation of armrest and arm support. The most common type of ambiguity in our dataset is the part labeling ambiguity. The fine-grained part concepts, though intended to be different category-wise, may apply to the same part on a given object. For example, a connecting structure between the seat and the base of a chair can be considered as chair seat support or chair base connector.

We study the mutual human agreement on the multiple annotation subset. We consider the parts defined at the leaf node level of segmentation on the hierarchy and compute the confusion matrix across multiple human annotations666We consider the entire path labels as histories to the leaf nodes when computing the confusion matrix.. The ideal confusion matrix should be close to the diagonal matrix without any part-level ambiguity. In our analysis, we observe human disagreement among some of our initial part definitions. To address the ambiguity, we either merge two similar concepts with high confusion scores or remove the hard-to-distinguish parts from evaluation. For example, we find our annotators often mix up the annotation for regular tables and desks due to the similarity in the two concepts. Thus, we merge the desk subtype into the regular table subtype to address this issue. In other cases, some small parts such as the buttons on the displays are very tricky to segment out from the main display frame. Since they may not be reliably segmented out, we decided to remove such unclear segmentation from evaluation.

 

Avg Bag Bed Bott Bowl Chair Clock Dish Disp Door Ear Fauc Hat Key Knife Lamp Lap Micro Mug Frid Scis Stora Table Trash Vase

 

P1 71.8 59.3 39.6 81.0 78.5 81.8 67.1 78.9 88.2 71.1 68.0 67.5 58.5 65.6 66.5 46.5 96.5 75.0 84.2 79.6 86.5 55.9 85.6 66.7 76.3
P2 50.1 21.3 52.4 60.0 47.1 43.5 64.3 63.9 48.8 50.0
P3 48.2 13.0 55.3 44.8 37.8 55.2 79.0 38.8 47.5 55.5 40.0 34.7 54.5 53.2 47.4 42.5 46.4 74.0
Avg 63.4 59.3 24.6 68.2 78.5 59.7 52.4 64.7 83.6 52.3 57.8 61.5 58.5 65.6 53.2 41.6 96.5 64.6 84.2 65.6 86.5 50.7 59.4 56.5 75.2

 

PP1 76.8 72.7 54.7 85.8 78.5 84.5 74.1 81.9 90.7 73.5 77.8 73.6 64.2 62.5 75.0 65.5 96.6 80.3 90.9 72.1 87.5 61.2 86.7 71.5 81.4
PP2 54.7 34.8 54.9 60.6 57.0 56.8 63.0 58.4 52.9 53.6
PP3 53.4 25.1 61.0 49.6 46.1 52.5 81.0 48.0 56.1 60.4 49.1 46.0 54.3 50.7 50.6 47.0 54.7 75.1
Avg 68.1 72.7 38.2 73.4 78.5 63.0 60.1 65.0 85.8 59.5 67.0 67.0 64.2 62.5 62.0 56.1 96.6 65.9 90.9 60.4 87.5 54.9 62.4 63.1 78.2

 

S1 73.9 72.9 55.9 86.1 83.4 83.8 72.1 73.3 90.4 60.4 70.6 71.5 71.6 64.6 42.1 59.1 97.1 78.6 91.6 68.7 77.0 64.2 83.8 74.4 79.5
S2 53.3 37.8 53.6 65.3 55.0 41.4 62.1 62.6 49.8 51.7
S3 48.0 27.2 52.8 44.7 44.2 51.1 77.2 40.7 47.5 53.7 27.3 35.7 54.4 52.4 53.1 43.3 48.0 62.3
Avg 65.1 72.9 40.3 69.4 83.4 60.7 58.1 63.2 83.8 52.0 59.0 62.6 71.6 64.6 34.7 45.4 97.1 65.0 91.6 61.2 77.0 55.7 59.6 61.2 70.9

 

C1 75.5 72.0 55.3 83.6 75.0 83.9 65.6 81.8 91.9 68.1 74.5 71.1 66.8 70.4 68.1 55.6 97.1 83.1 92.7 78.9 92.6 58.8 85.5 67.7 71.8
C2 52.1 36.6 52.9 63.4 54.9 42.4 64.1 57.7 54.4 42.7
C3 49.6 29.1 58.7 47.7 36.2 55.3 81.5 40.4 55.8 60.7 26.4 34.4 58.7 50.8 52.3 37.4 50.8 67.0
Avg 66.3 72.0 40.3 71.2 75.0 61.5 50.9 66.8 86.7 54.5 65.2 65.9 66.8 70.4 47.2 44.1 97.1 68.6 92.7 62.5 92.6 55.2 55.2 59.2 69.4

 

Table 7: Fine-grained semantic segmentation results (shape mIoU %). Algorithm P, P+, S and C refer to PointNet [29], PointNet++ [30], SpiderCNN [40] and PointCNN [24], respectively. The number 1, 2 and 3 refer to the three levels of segmentation: coarse-, middle- and fine-grained. We put short lines for the levels that are not defined.

Table 6 compares the annotation consistency before and after the template refinement process. We compute the confusion matrices at the most fine-grained segmentation level. After the template refinement, the data consistency score is 83.3% on average, having 13.5% improvement over the raw annotation. The template refinement process improves the annotation consistency by a clear margin. This also reflects the complexity of the task in terms of annotating fine-grained part concepts. Future works may investigate how to further design better templates with less part ambiguities.

c.3 More Visualization of Hierarchical Templates

Figure 18, 19 and 20 show more visualization for the expert-designed hierarchical templates after resolving the data inconsistency and conducting template refinements. We show the lamp template in the main paper.

Appendix D Tasks and Benchmarks

In this section, we provide more details about the architectures and training details for the benchmark algorithms. We also present additional evaluation metrics, shape mean Intersection-over-Union (shape mIoU) and shape mean Average-Precision (Shape mAP), and report the quantitative results using these metrics.

d.1 Fine-grained Semantic Segmentation

More Architecture and Training Details

We follow the default architectures and training hyper-parameters used in the original papers: PointNet [29], PointNet++ [30], SpiderCNN [40] and PointCNN [24], except the following few modifications:

  • Instead of training one network for all object categories as done in the four prior works, we train separate networks for each object category at each segmentation level. This is mainly to handle the increase in the number of parts for fine-grained part segmentation. Originally, there are only 50 parts for all 16 object categories using the coarse ShapeNet Part dataset [43]. Now, using PartNet, there could be 480 different part semantics in total. Also, due to the data imbalance among different object categories, training a single network may overfit to the big categories.

  • We change the input point cloud size to 10,000. The original papers usually sample 1,000, 2,000 or 4,000 points and input to the networks. PartNet suggests to use at least 10,000 to guarantee enough point sampling over small fine-grained parts, e.g. a door handle, or a small button.

  • We reduce the batch sizes for training the networks if necessary. Since we use point cloud size 10,000, to fit the training in NVIDIA TITAN XP GPU 12G memory, we need to adjust the training batch size accordingly. For PointNet [29], PointNet++ [30], SpiderCNN [40] and PointCNN [24], we use batch size of 24, 24, 2 and 4 respectively.

  • We only input 3D coordinates as inputs to all the networks for fair comparison. Although the 3D CAD models in ShapeNet [3] usually provide additional features, e.g. opacity, point normals, textures and material information, there is no guarantee for the quality of such information. Thus, we choose not to use them as the inputs. Also, only using pure geometry potentially increase the network generalizability to unseen objects or real scans [29]. PointNet++ [30] and SpiderCNN [40]

    by defaults take advantage of the point normals as additional inputs. In this paper, we remove such inputs to the networks. However, point normals can be estimated from the point clouds. We leave this as a future work.

 

Avg Bed Bott Chair Clock Dish Disp Door Ear Fauc Knife Lamp Micro Frid Stora Table Trash Vase

 

  Bottom-Up 65.9 42.0 74.3 63.8 64.1 66.3 84.2 61.4 70.0 74.2 67.1 62.7 63.0 60.8 57.8 65.7 62.8 80.9
  Top-Down 65.9 42.0 73.7 62.3 65.5 64.0 85.5 63.1 71.1 73.5 68.8 63.3 62.7 58.8 57.6 66.2 63.0 79.3
  Ensemble 66.6 42.9 74.4 64.3 65.5 62.7 85.8 63.7 71.7 74.0 66.7 63.4 61.9 61.5 60.6 67.5 64.0 82.2

 

Table 8: Hierarchical segmentation results (shape mIoU %). We present the hierarchical segmentation performances for three baseline methods: bottom-up, top-down and ensemble. We conduct experiments on 17 out of 24 categories with tree depth bigger than 1.

Shape mIoU Metric and Results

We introduce the shape mean Intersect-over-Union (Shape mIoU) evaluation metric as a secondary metric to the Part-category mIoU metric in the main paper. Shape mIoU metric considers shapes as evaluation units and measures how an algorithm segment an average shape in the object category. In contrast, Part-category mIoU reports the average performance over all part semantics and indicates how an algorithm performs for any given part category.

Shape mIoU is widely used on ShapeNet Part dataset [43] for 3D shape coarse semantic segmentation [29, 30, 40, 24]. We propose a slightly different version for fine-grained semantic segmentation. For each test shape, we first compute the IoU for each part semantics that either presents in the ground-truth or is predicted by the algorithm, and then we calculate the mean IoU for this shape. We remove the ground-truth unlabeled points from the evaluation. Finally, we calculate the Shape mIoU by averaging mIoU over all test shape instances.

We benchmark the four algorithms using Shape mIoU in Table 7. Besides the Shape mIoU scores for each object category at each segmentation level, we also report the average across levels for each object categories and further calculate the average over all object categories.

We observe that PointNet++ [30] achieves the best performance using the Shape mIoU metric, while PointCNN [24] performs the best using the Part-category mIoU metric. The Part-category mIoU metric considers all part semantics equally while the Shape mIoU metric considers all shapes equally. We observe an unbalanced counts for different part semantics in most object categories, e.g. there are much more chair legs than chair wheels. To achieve good numbers on Part-category mIoU, a segmentation algorithm needs to perform equally well on both frequent parts and rare parts, while the Shape mIoU metric bias over the frequently observed parts.

d.2 Hierarchical Semantic Segmentation

We describe the architecture and training details for the three baseline methods we propose for hierarchical semantic segmentation in the main paper. All three methods use PointNet++ [30] segmentation network as the network backbone. The difference of the three methods is mainly on the training and inference strategies to enforce the tree knowledge to the final prediction.

The Bottom-up Method

The bottom-up method learns a network to perform segmentation at the most fine-grained leaf part semantics. We use the PointNet++ [30] segmentation network with a softmax activation layer as the network architecture. At inference time, we use the ground-truth tree hierarchy to gather the prediction for the parent nodes. The parent node prediction is the sum of all its children node predictions. Even though we only train for the leaf node parts, the parent history is implicitly encoded. For example, we define vertical bars for both chair back and chair arm, but they are two different leaf node parts: chair back vertical bar and chair arm vertical bar.

In the ground-truth annotation, all the points in the point cloud belong to the root node. Each point is assigned a path of labels from the root node to some intermediate node in the tree. The paths for most points are all the way down to the leaf levels while some points may not. For example, a point on a bed blanket (removed from evaluation since it cannot be distinguished without color information) may be assigned with labels {bed, bed unit, sleeping area} in the ground-truth annotation. The part sleeping area is not a leaf part. For such cases, we introduce an additional leaf node other for each parent node in the tree and consider them in the training.

The Top-down Method

The top-down method learns a multi-labeling task for all the part semantics in the tree, considering both the leaf nodes and the parent nodes. Compared to the bottom-up method, the top-down method takes advantage of the tree structures at training time.

Assuming there are tree nodes in the hierarchy, we train a PointNet++ [30] segmentation network for a -way classification for each point. We apply a softmax activation layer to enforce label mutual exclusiveness. For a point with the ground-truth labels and prediction softmax scores , we train the labels using a multi-labeling cross-entropy loss

(1)

to increase the values of all the three label predictions over the rest labels.

 

Avg Bag Bed Bott Bowl Chair Clock Dish Disp Door Ear Fauc Hat Key Knife Lamp Lap Micro Mug Frid Scis Stora Table Trash Vase

 

S1 72.5 62.8 38.7 76.7 83.2 91.5 41.5 81.4 91.3 71.2 81.4 82.2 71.9 23.2 78.0 60.3 100 76.2 94.3 60.6 74.9 55.0 80.1 76.1 87.1
S2 50.2 22.7 51.1 78.7 43.3 49.1 68.6 42.9 51.9 43.7
S3 50.2 17.5 66.5 42.3 40.7 59.3 83.9 29.0 60.2 61.6 55.0 37.6 53.7 30.6 45.1 37.8 50.0 82.0
Avg 64.2 62.8 26.3 71.6 83.2 61.6 41.1 73.1 87.6 47.8 70.8 71.9 71.9 23.2 66.5 49.0 100 66.2 94.3 44.7 74.9 50.7 53.8 63.0 84.6

 

O1 80.3 78.4 62.2 80.8 83.8 94.9 74.6 81.4 94.3 76.1 87.1 86.5 77.8 44.5 76.6 65.0 100 79.5 95.3 79.0 87.6 62.7 88.1 82.3 89.0
O2 60.5 29.4 64.7 75.4 61.1 56.8 78.2 61.7 57.4 59.4
O3 57.7 22.1 68.3 58.4 53.7 67.5 84.8 38.0 62.4 66.8 63.5 45.8 54.0 45.0 52.6 52.5 58.7 86.4
Avg 72.2 78.4 37.9 74.6 83.8 72.7 64.2 74.8 89.5 58.4 74.8 76.6 77.8 44.5 70.1 55.8 100 70.6 95.3 61.9 87.6 57.6 66.7 70.5 87.7

 

Table 9: Instance segmentation results (shape mAP %, IoU threshold 0.5). Algorithm S and O refer to SGPN [34] and our proposed method respectively. The number 1, 2 and 3 refer to the three levels of segmentation: coarse-, middle- and fine-grained. We put short lines for the levels that are not defined.

The Ensemble Method

The ensemble method trains multiple neural networks at different levels of segmentation as defined in the fine-grained semantic segmentation task. The key idea is that conducting segmentation at the coarse-, middle- and fine-grained levels separately may learn different features that work the best at each level. Compared to the bottom-up method that we only train at the most fine-grained level, additional signal at the coarse level helps distinguish the coarse-level part semantics more easily. For example, the local geometric features for both chair back vertical bars and chair arm vertical bars may be very similar, but the coarse-level semantics may distinguish chair backs and chair arms better.

During the training, we train 23 networks at multiple levels of segmentation. At the inference time, we perform a joint inference considering the prediction scores from all the networks. We use a path-voting strategy: for each path from the root node to the leaf node, we calculate the average log-likelihood over the network prediction scores after applying the softmax activations, and select the path with the highest score as the joint label predictions.

Shape mIoU Metric and Results

Similar to Sec D.1, we define Shape mIoU for hierarchical segmentation. The mIoU for each shape is calculated over the part semantics in the entire hierarchical template that are either predicted by the network or included in the ground-truth. The unrelated parts are not taken into consideration. Table 8 shows the quantitative evaluation for the three baseline methods. We observe similar performance for the three methods, with the ensemble method works slightly better.

d.3 Instance Segmentation

More Architecture and Training Details

To train our proposed method, we use batch size 32, learning rate 0.001, and the default batch normalization settings used in the PointNet++ 

[30].

For SGPN [34], we use two-stage training as suggested by the authors of [34]

. We first pretrain the PointNet++ semantic segmentation branch using batch size 32 and learning rate 0.001, with the default batch normalization as in PointNet++. And then, we jointly train for the semantic segmentation, similarity score matrix and confidence scores with batch size 1 and learning rate 0.0001. As suggested in the original SGPN paper, for the first five epochs of the joint training, we only turn on the loss for training the similarity scores matrix. The rest training epochs are done with the full losses switched on. We have to use batch size 1 because the input point cloud has the size of 10,000 and thus the similarity score matrix forms a

matrix, which occupies too much GPU memory. Our proposed method is more memory-efficient, compared to SGPN. We also observe that our training is much faster than SPGN. We train all the networks until convergence.

Shape mAP Metric and Results

We define Shape mean Average-Precision (Shape mAP) metric as a secondary metric to the Part-category mAP metric in the main paper. Similar to the Shape mIoU scores we use in Sec D.1 and D.2, Shape mAP reports the part instance segmentation performance on an average shape in a object category. It averages across the test shapes, instead of averaging across all different part semantics, as benchmarked by Part-category mAP in the main paper.

To calculate Shape mAP for a test shape, we consider the AP for the part semantics that occur either in the ground-truth or the prediction for the given shape and compute their average as the mean AP score. Then, we average the mAP across all test shapes within a object category. Table 9 reports the part instance segmentation performance under the Shape mAP scores. We see a clear performance improvement of the proposed method over SGPN.

Figure 13: Fine-grained instance-level segmentation visualization (1/2). We present visualization for example fine-grained instance-level segmentation annotations for chair, bag, bed, bottle, bowl, clock, dishwasher, display, door, earphone, faucet, and hat.
Figure 14: Fine-grained instance-level segmentation visualization (2/2). We present visualization for example fine-grained instance-level segmentation annotations for storage furniture, keyboard, knife, laptop, lamp, microwave, mug, refrigerator, scissors, table, trash can, and vase.
Figure 15: Hierarchical instance-level segmentation visualization (1/3). We present visualization for example hierarchical instance-level segmentation annotations for bed, clock, storage furniture, faucet, table, and chair. The lamp examples are shown in the main paper. The And-nodes are drawn in solid lines and Or-nodes in dash lines.
Figure 16: Hierarchical instance-level segmentation visualization (2/3). We present visualization for example hierarchical instance-level segmentation annotations for dishwasher, laptop, display, trash can, door (door set), earphone, vase (pot), and keyboard. The lamp examples are shown in the main paper. The And-nodes are drawn in solid lines and Or-nodes in dash lines.
Figure 17: Hierarchical instance-level segmentation visualization (3/3). We present visualization for example hierarchical instance-level segmentation annotations for scissors, microwave, knife (cutting instrument), hat, bowl, bottle, mug, bag, and refrigerator. The lamp examples are shown in the main paper. The And-nodes are drawn in solid lines and Or-nodes in dash lines.
Figure 18: Template visualization (1/3). We present the templates for table and chair. The lamp template is shown in the main paper. The And-nodes are drawn in solid lines and Or-nodes in dash lines.
Figure 19: Template visualization (2/3). We present the templates for storage furniture, faucet, clock, bed, knife (cutting instrument), and trash can. The lamp template is shown in the main paper. The And-nodes are drawn in solid lines and Or-nodes in dash lines.
Figure 20: Template visualization (3/3). We present the templates for earphone, bottle, scissors, door (door set), display, dishwasher, microwave, refrigerator, laptop, vase (pot), hat, bowl, bag, mug, and keyboard. The lamp template is shown in the main paper. The And-nodes are drawn in solid lines and Or-nodes in dash lines.