Generative 3D Part Assembly via Dynamic Graph Learning

by   Jialei Huang, et al.
Peking University
Stanford University

Autonomous part assembly is a challenging yet crucial task in 3D computer vision and robotics. Analogous to buying an IKEA furniture, given a set of 3D parts that can assemble a single shape, an intelligent agent needs to perceive the 3D part geometry, reason to propose pose estimations for the input parts, and finally call robotic planning and control routines for actuation. In this paper, we focus on the pose estimation subproblem from the vision side involving geometric and relational reasoning over the input part geometry. Essentially, the task of generative 3D part assembly is to predict a 6-DoF part pose, including a rigid rotation and translation, for each input part that assembles a single 3D shape as the final output. To tackle this problem, we propose an assembly-oriented dynamic graph learning framework that leverages an iterative graph neural network as a backbone. It explicitly conducts sequential part assembly refinements in a coarse-to-fine manner, exploits a pair of part relation reasoning module and part aggregation module for dynamically adjusting both part features and their relations in the part graph. We conduct extensive experiments and quantitative comparisons to three strong baseline methods, demonstrating the effectiveness of the proposed approach.


page 8

page 10


RGL-NET: A Recurrent Graph Learning framework for Progressive Part Assembly

Autonomous assembly of objects is an essential task in robotics and 3D c...

3D Part Assembly Generation with Instance Encoded Transformer

It is desirable to enable robots capable of automatic assembly. Structur...

Learning 3D Part Assembly from a Single Image

Autonomous assembly is a crucial capability for robots in many applicati...

6D Robotic Assembly Based on RGB-only Object Pose Estimation

Vision-based robotic assembly is a crucial yet challenging task as the i...

Bilevel Optimization for Just-in-Time Robotic Kitting and Delivery via Adaptive Task Segmentation and Scheduling

Kitting refers to the task of preparing and grouping necessary parts and...

Neural Shape Mating: Self-Supervised Object Assembly with Adversarial Shape Priors

Learning to autonomously assemble shapes is a crucial skill for many rob...

CFVS: Coarse-to-Fine Visual Servoing for 6-DoF Object-Agnostic Peg-In-Hole Assembly

Robotic peg-in-hole assembly remains a challenging task due to its high ...

Code Repositories


Generative 3D Part Assembly via Dynamic Graph Learning, NeurIPS 2020

view repo

1 Introduction

It is a complicated and laborious task, even for humans, to assemble an IKEA furniture from its parts. Without referring to any procedural or external guidance, e.g. reading the instruction manual, or watching a step-by-step video demonstration, the task of 3D part assembly involves exploring an extremely large solution spaces and reasoning over the input part geometry for candidate assembly proposals. To assemble a physically stable furniture, a rich set of part relations and constraints need to be satisfied for a successful assembly.

There are some literature in the computer vision and graphics fields that study part-based 3D shape modeling and synthesis. For example, Chaudhuri and Koltun (2010); Shen et al. (2012); Sung et al. (2017) employ a third-party repository of 3D meshes for part retrieval to assemble a complete shape. Benefiting from recent large-scale part-level datasets Mo et al. (2019b); Yi et al. (2016)

and the advent of deep learning techniques, some recent works

Li et al. (2020a); Schor et al. (2019); Wu et al. (2020) leverage deep neural networks to sequentially generate part geometry and posing transform for shape composition. Though similar to our task, none of these works addresses the exactly same setting to ours. They either allow free-form part generation for the part geometry, or assume certain part priors, such as a known number of parts, known part semantics, an available large part pool, etc. In our setting, we assume no semantic knowledge upon the input parts and assemble 3D shapes conditioned on a given set of fine-grained part geometry with variable number of parts for different shape instances.

In this paper, we propose to use a dynamic graph learning framework that predicts a 6-DoF part pose, including a rigid rotation and translation, for each input part point cloud via forming a dynamically varying part graph and iteratively reasoning over the part poses and their relations. We employ an iterative graph neural network to gradually refine the part pose estimations in a coarse-to-fine manner. At the core of our method, we propose the dynamic part relation reasoning module and the dynamic part aggregation module that jointly learns to dynamically evolve part node and edge features within the part graph.

Lack of the real-world data for 3D part assembly, we train and evaluate the proposed approach on the synthetic PartNet dataset, which provides a large-scale benchmark with ground-truth part assembly for ShapeNet models at the fine-grained part granularity. Although there is no previous work studying the exactly same problem setting as ours, we formulate three strong baselines inspired by previously published works on similar task domains and demonstrate that our method outperforms baseline methods by significant margins.

Diagnostic analysis further indicates that in the iterative part assembly procedure, a set of central parts (e.g. chair back, chair seat) learns much faster than the other peripheral parts (e.g. chair legs, chair arms), which quickly sketches out the shape backbone in the first several iterations. Then, the peripheral parts gradually adjust their poses to match with the central part poses via the graph message-passing mechanism. Such dynamic behaviors are automatically emerged without direct supervision and thus demonstrate the effectiveness for our dynamic graph learning framework.

2 Related Work

Assembly-based 3D modeling. Part assembly is an important task in many fields. Recent works in the robotic community Litvak et al. (2019); Zakka et al. (2019); Shao et al. (2019); Luo et al. (2019)

emphasize the planning, in-hand manipulation and robot grasping using a partial RGB-D observation in an active learning manner, while our work shares more similarity with the work in the vision and graphics background, which focuses on the problem of pose or joint estimation for part assembly. On this side,

Funkhouser et al. (2004) is the pioneering work to construct 3D geometric surface models by assembling parts of interest in a repository of 3D meshes. The follow-ups Chaudhuri et al. (2011); Kalogerakis et al. (2012); Jaiswal et al. (2016) learn a probabilistic graphical model that encodes semantic and geometric relationships among shape components to explore the part-based shape modeling. Chaudhuri and Koltun (2010); Shen et al. (2012); Sung et al. (2017) model the 3D shape conditioned on the single-view scan input, rough models created by artists or a partial shape via an assembly manner.

However, most of these previous works rely on a third-part shape repository to query a part for the assembly. Inspired by the recent generative deep learning techniques and benefited from the large-scale annotated object part datasets Mo et al. (2019b); Yi et al. (2016), some recent works Li et al. (2020a); Schor et al. (2019); Wu et al. (2020) generate the parts and then predict the per-part transformation to compose the shape. Dubrovina et al. (2019) introduces a Decomposer-Composer network for a novel factorized shape latent space. These existing data-driven approaches mostly focus on creating a novel shape from the accumulated shape prior, and base the estimated transformation parameters on the 6-DoF part pose of translation and scale. They assume object parts are well rotated to stand in the object canonical space. In this work, we focus on a more practical problem setting, similar to assembling parts into a furniture in IKEA, where all the parts are provided and laid out on the ground in the part canonical space. Our goal of part assembly is to estimate the part-wise 6-DoF pose of rotation and translation to compose the parts into a complete shape (furniture). A recent work Li et al. (2020b) has a similar setting but requires an input image as guidance.

Structure-aware generative networks. Deep generative models, such as generative adversarial networks (GAN) Goodfellow et al. (2014)

and variational autoencoders (VAE)

Kingma and Welling (2014), have been explored recently for shape generation tasks. Li et al. (2017a); Mo et al. (2019a) propose hierarchical generative networks to encode structured models, represented as abstracted bounding box. The follow-up work Mo et al. (2020a) extends the learned structural variations into conditional shape editing. Gao et al. (2019) introduces a two-level VAE to jointly learns the global shape structure and fine part geometries. Wu et al. (2019) proposes a two-branch generative network to exchange information between structure and geometry for 3D shape modeling. Wang et al. (2018) presents a global-to-local adversarial network to construct the overall structure of the shape, followed by a conditional autoencoder for part refinement. Recently, Mo et al. (2020b) employs a conditional GAN to generate a point cloud from an input rough shape structure. Most of the aforementioned works couple the shape structure and geometry into the joint learning for diverse and perceptually plausible 3D modeling. However, we focus on a more challenging problem that aims at generating shapes with only structural variations conditioned on the fixed detailed part geometry.

3 Assembly-Oriented Dynamic Graph Learning

Given a set of 3D part point clouds as inputs, where denotes the number of parts which may vary for different shapes, the goal of our task is to predict a 6-DoF part pose for each input part and form a final part assembly for a complete 3D shape , where denotes the transformed part point cloud according to . The input parts may come in many geometrically-equivalent groups, e.g. four chair legs, two chair arms, where the parts in each group share the same part geometry and we assume to know the part count in each group.

To tackle this problem, we propose an assembly-oriented dynamic graph learning framework that leverages an iterative graph neural network (GNN) as a backbone, which explicitly conducts sequential part assembly refinements in a coarse-to-fine manner, and exploits a pair of part relation reasoning module and part aggregation module for iteratively adjusting part features and their relations in the part graph. Figure 1 illustrates our proposed pipeline. Below, we first introduce the iterative GNN backbone and then discuss the dynamic part relation reasoning module and part aggregation module in detail.

3.1 Iterative Graph Neural Network Backbone

We represent the dynamic part graph at every time step as a self-looped directed graph , where is the set of nodes and is the set of edges in . We treat each part as a node in the graph and initialize its attribute via encoding the part geometry as , where is a parametric function implemented as a vanilla PointNet Qi et al. (2017) that extracts a global permutation-invariant feature summarizing the input part point cloud . We use a shared PointNet to process all the parts.

We use a fully connected graph, drawing an edge among all pairs of parts, and perform the iterative graph message-passing operations via alternating between updating the edge and node features. To be specific, the edge attribute emitting from node to at time step is calculated as a neural message


which is then leveraged to update the node attribute at the next time step by aggregating messages from all the other nodes


that takes both the previous node attribute and the averaged message among neighbors as inputs. The part pose

, including a 3-DoF rotation represented as a unit 4-dimensional Quaternion vector and a 3-dimensional translation vector denoting the part center offset, is then regressed by decoding the updated node attribute via


Besides the node feature at the current time step , also takes as input the initial node attribute to capture the raw part geometry information, and the estimated pose in the last time step for more coherent pose evolution. Note that is not defined and hence not inputted to at the first time step.

In our implement, , and

are all parameterized as Multi-Layer Perceptrons (MLP) that are shared across all the edges or nodes for each time step. Note that we use different network weights for different iterations, since the node and edge features evolve over time and may contain information at different scales. Our iterative graph neural network runs for 5 iterations and learns to sequentially refine part assembly in a coarse-to-fine manner.

Figure 1: The proposed dynamic graph learning framework. The iterative graph neural network backbone takes a set of part point clouds as inputs and conducts 5 iterations of graph message-passing for coarse-to-fine part assembly refinements. The graph dynamics is encoded into two folds, (a) reasoning the part relation (graph structure) from the part pose estimation, which in turn also evolves from the updated part relations, and (b) alternatively updating the node set by aggregating all the geometrically-equivalent parts (the red and purple nodes), e.g.

two chair arms, into a single node (the yellow node) to perform graph learning on a sparse node set for even time steps, and unpooling these nodes to the dense node set for odd time steps. Note the semi-transparent nodes and edges are not included in graph learning of certain time steps.

3.2 Dynamic Relation Reasoning Module

The relationship between entities is known to be important for a better understanding of visual data. There are various relation graphs defined in the literature. Xu et al. (2017); Li et al. (2017b); Yang et al. (2018); Chen et al. (2019) learn the scene graph from the labeled object relationship in a large-scale image dataset namely Visual Genome Krishna et al. (2017), in favor of the 2D object detection task. Li et al. (2019); Zhou et al. (2019); Wang et al. (2019a); Ritchie et al. (2019)

calculate the statistical relationships between objects via some geometrical heuristics for the 3D scene generation problem. In terms of the shape understanding,

Mo et al. (2019a); Gao et al. (2019) define the relation as the adjacency or symmetry between every two parts in the full shape.

In our work, we learn dynamically evolving part relationships for the task of part assembly. Different from many previous dynamic graph learning algorithms Wang et al. (2019b); Zhang et al. (2020) that only evolve the node and edge features implicitly, we propose to update the relation graph based on the estimated part assembly at each time step explicitly. This is special for our part assembly task and we incorporate the assembly-flavor in our model design. At each time step, we predict the part pose for each part and the obtained temporary part assembly enables explicit reasoning about how to refine part poses in the next step considering the current part disconnections and the overall shape geometry.

To achieve this goal, besides the maintained edge attributes, we also learn to reason a directed edge-wise weight scalar to indicate the significance of the relation from node to . Then, we update the node attribute at time step via multiplying the weight scalar and edge attribute


There are various options to implement . For example, one can employ the part geometry transformed with the estimated poses to regress the relation, or incorporate the holistic assembled shape feature. In our implementation, however, we find that directly exploiting the pose features to learn the relation is already satisfactory. This is mainly caused by the fact that the parts of different semantics may share similar geometries but usually have different poses. For instance, the leg and leg stretcher are geometrically similar but placed and oriented very differently. Therefore, we adopt the simple solution by reasoning the relation only from the estimated poses


where both and are parameterized as MLPs. is used to extract the independent pose feature from each part pose prediction. Note that we set in the beginning.

3.3 Dynamic Part Aggregation Module

We observe that geometrically-equivalent parts are highly correlated and thus very likely to share common knowledge regarding part poses and relationships. For example, four long sticks may all serve as legs that stand upright on the ground, or two leg stretchers that are parallel to each other. Thus, we propose a dynamic part aggregation module that allows more direct information exchanges among geometrically-equivalent parts.

To achieve this goal, we explicitly create two sets of nodes at different assembly levels: a dense node set including all the part nodes, and a sparse node set created by aggregating all the geometrically-equivalent nodes into a single node. Then, we perform the graph learning via alternatively updating the relation graph between the dense and sparse node sets. In this manner, we allow dynamic communications among geometrically-equivalent parts for synchronizing shared information while learning to diverge to different part poses.

In our implementation, we denote as the dense node graph at time step . To create a sparse node graph at time step , we firstly aggregate the node attributes among the geometrically-equivalent parts

via max-pooling into a single node



Then, we aggregate the relation weights from geometrically-equivalent parts to any other node


The inverse relation emitted from the node to the aggregated node is computed similarly as


All these equations are conducted once we finish the update of dense node graph , then we are able to operate on the sparse node graph following Eq. (4) and (5). To enrich the sparse node set back to dense node set, we simply unpool the node features to a corresponding set of geometrically-equivalent parts. We alternatively conduct dynamic graph learning over the dense and sparse node sets at odd and even iterations separately.

3.4 Training and Losses.

Given an input set of part point clouds, there may be multiple solutions for the shape assembly. For example, one can move a leg stretcher up and down as long as it is connected to two parallel legs. The chair back can also be possibly laid down to form a deck chair. To address the multi-modal predictions, we employ the Min-of-N (MoN) loss Fan et al. (2017) to balance between the assembly quality and diversity. Let denote our whole framework, which takes in the part point cloud set and a random noise

sampled from unit Gaussian distribution

. Let

be any loss function supervising the network outputs

and be one provided ground truth sample in the dataset, then the MoN loss is defined as


The MoN loss encourages at least one of the predictions to be close to the ground truth data, which is more tolerant to the dataset of limited diversity and hence more suitable for our problem. In practice, we sample 5 particles of to approximate Eq. (9).

The is implemented as a weighted combination of both local part and global shape losses, detailed as below. Each part pose can be decomposed into rotation and translation . We supervise the translation via an loss,


The rotation is supervised via Chamfer distance on the rotated part point cloud


In order to achieve good assembly quality holistically, we also supervise the full shape assembly using Chamfer distance (CD),


In all equations above, the asterisk symbols denote the corresponding ground-truth values.

4 Experiments and Analysis

We conduct extensive experiments demonstrating the effectiveness of the proposed method and show quantitative and qualitative comparisons to three baseline methods. We also provide diagnostic analysis over the learned part relation dynamics, which clearly illustrates the iterative coarse-to-fine refinement procedure.

4.1 Dataset

We leverage the recent PartNet Mo et al. (2019b), a large-scale shape dataset with fine-grained and hierarchical part segmentations, for both training and evaluation. We use the three largest categories, chairs, tables and lamps, and adopt its default train/validation/test splits in the dataset. In total, there are 6,323 chairs, 8,218 tables and 2,207 lamps. We deal with the most fine-grained level of PartNet segmentation. We use Furthest Point Sampling (FPS) to sample 1,000 points for each part point cloud. All parts are zero-centered and provided in the canonical part space computed using PCA.

4.2 Baseline Approaches

Since our task is novel, there is no direct baseline method to compare. However, we try to compare to three baseline methods inspired by previous works sharing similar spirits of part-based shape modeling or synthesis.

B-Complement: ComplementMe Sung et al. (2017) studies the task of synthesizing 3D shapes from a big repository of parts and mostly focus on retrieving part candidates from the part database. We modify the setting to our case by limiting the part repository to the input part set and sequentially predicting a part pose for each part.

B-LSTM: Instead of leveraging a graph structure to encode and decode part information jointly, we use a bidirectional LSTM module similar to PQ-Net Wu et al. (2020) to sequentially estimate the part pose. Note that the original PQ-Net studies the task of part-aware shape generative modeling, which is a quite different task from ours.

B-Global: Without using the iterative GNN, we directly use the per-part feature, augmented with the global shape descriptor, to regress the part pose in one shot. Though dealing with different tasks, this baseline method borrows similar network design with CompoNet Schor et al. (2019) and PAGENet Li et al. (2020a).

Shape Chamfer Distance Part Accuracy Connectivity Accuracy
Chair Table Lamp Chair Table Lamp Chair Table Lamp
B-Global 0.0146 0.0112 0.0079 15.7 15.37 22.61 9.90 33.84 18.6
B-LSTM 0.0131 0.0125 0.0077 21.77 28.64 20.78 6.80 22.56 14.05
B-Complement 0.0241 0.0298 0.0150 8.78 2.32 12.67 9.19 15.57 26.56
Ours 0.0091 0.0050 0.0093 39.00 49.51 33.33 23.87 39.96 41.70
Table 1: Quantitative Comparison between our approach and the baseline methods.

4.3 Evaluation Metrics

We use the Minimum Matching Distance (MMD) Achlioptas et al. (2018) to evaluate the fidelity of the assembled shape. Conditioned on the same input set of parts, we generate multiple shapes sampled from different Gaussian noises, and measure the minimum distance between the ground truth and the assembled shapes. We adopt three distance metrics, part accuracy, shape chamfer distance following Li et al. (2020b) and connectivity accuracy proposed by us. The part accuracy is defined as,


where we pick . Intuitively, it indicates the percentage of parts that match the ground truth parts to a certain CD threshold. Shape chamfer distance is calculated the same as Eq. 12.

Connectivity Accuracy. The part accuracy measures the assembly performance by considering each part separately. In this work, we propose the connectivity accuracy to further evaluate how well the parts are connected in the assembled shape. For each connected part pair <> in the object space, we firstly select one point in part that is closest to part as ’s contact point with respect to , then select the point in that is closest to as the corresponding ’s contact point . Given the predefined contact point pair located in the object space, we transform each point into its corresponding canonical part space as . Then we calculate the connectivity accuracy of an assembled shape as


where denotes the set of contact point pairs and . It evaluates the percentage of correctly connected parts.

Figure 2: Qualitative Results. Left: visual comparisons between our algorithm and the baseline methods; Right: multiple plausible assembly results generated by our network.
Shape Chamfer Distance Part Accuracy Connectivity Accuracy
Our backbone w/o graph learning 0.0086 26.05 28.07
Our backbone 0.0055 42.09 35.87
Our backbone w. relation reasoning 0.0052 46.85 38.60
Our full algorithm 0.0050 49.51 39.96
Table 2: Experiments demonstrate that all the three key components are necessary.

4.4 Results and Comparisons

We present the quantitative comparisons with the baselines in Table 1. Our algorithm outperforms all these approaches by a significant margin for most columns, especially on the part and connectivity accuracy metrics. According to the visual results in Figure 2 (left), we also observe the best assembly results are achieved by our algorithm, while the baseline methods usually fail in producing well-structured shapes. We also show multiple assembled shapes in Figure 2 (right) while sampling different Gaussian noises as inputs. We see that some bar-shape parts are assembled into different positions to form objects of different structures.

Figure 3: Dynamically evolving part relation weights among four common chair part types. The orange cells highlight the four directed edges with the maximal learned relation weight in the matrix, while the yellow cells indicate the minimal ones. The vertical axis denotes the emitting parts, and the horizontal axis denotes the receiving parts.
Step 1 Step 2 Step 3 Step 4 Step 5 Ground Truth
Figure 4: The time-varying part assembly results.

We also try to remove the three key components from our method: the iterative GNN, the dynamic part relation reasoning module and the dynamic part aggregation module. The results on the PartNet Table category are shown in Table 3 and we see that our full model achieves the best performance compared to the ablated versions. We firstly justify the effectiveness of our backbone by replacing the graph learning module with a multi-layer perception that estimates the part-wise pose from the concatenated separate and overall part features. We further incorporate the dynamic graph module into our backbone for evaluation. We observe that our proposed backbone and dynamic graph both contribute to the final performance significantly.

Figure 5: The part accuracy and its relative improvement summarized at each time step.

4.5 Dynamic Graph Analysis

Figure 6 summarizes the learned relation weights at each time step by averaging over all PartNet chairs. We pick the four common types of parts: back, seat, leg and arm. We see clearly similar statistical patterns for the even iterations and for the odd ones. At even iterations, the relation graph is updated from the dense node set. It focuses more on passing messages from the central parts (i.e. seat, back) to the peripheral parts (i.e. leg, arm) and overlooks the relation between legs. While at odd iterations, the relation graph is updated from the sparse node set, where the geometrically-equivalent parts like legs are aggregated to a single node. In this case, from the relation graph we can see that all the parts are influenced relatively more by the minor parts. On average, the central parts have bigger emitting relation weights than the peripheral parts, indicating that the central parts guide the assembly process more.

We further illustrate the changes of part accuracy and its associated improvement at each time step in Figure 5. We find that the central parts are consistently predicted more accurately than the peripheral parts. Interestingly, the improvement of peripheral parts is relatively higher than central parts at even iterations, demonstrating the fact that central parts guide the pose predictions for the peripheral parts. Figure 8 visualizes the time-varying part assembly results, showing that the poses for the central parts are firstly determined and then the peripheral parts gradually adjust their poses to match the central parts. The results finally converge to stable part assembly predictions.

5 Conclusion

In this work, we propose a novel dynamic graph learning algorithm for the part assembly problem. We develop an iterative graph neural network backbone that learns to dynamically evolve node and edge features within the part graph, augmented with the dynamic relation reasoning module and the dynamic part aggregation module. Through thorough experiments and analysis, we demonstrate that the proposed method achieves state-of-the-art performance over the three baseline methods by learning an effective assembly-oriented relation graph. Future works may investigate learning better part assembly generative models considering the part joint information and higher-order part relations.


This work was supported by the start-up research funds from Peking University (7100602564) and the Center on Frontiers of Computing Studies (7100602567). We would also like to thank Imperial Institute of Advanced Technology for GPU supports.

6 Appendix

This document provides the additional supplemental material that cannot be included into the main paper due to its page limit:

  • Additional ablation study.

  • Analysis of dynamic graph on additional parts and object categories.

  • Training details.

  • Failure cases and future work.

  • Additional results of structural variation.

  • Additional qualitative results.

A. Additional ablation study

In this section, we demonstrate the effectiveness of different components. We test our framework by proposing the following variants, where the results are in Table 3.

  • Our backbone w/o graph learning: Replacing the graph learning module with a multi-layer perception to estimate the part-wise pose from the concatenated separate and overall part features.

  • Our backbone: Only the iterative graph learning module.

  • Our backbone + relation reasoning: Incorporate the dynamic relation reasoning module to our backbone.

  • Our backbone + part aggregation: Incorporate the dynamic part aggregation module to our backbone.

  • Exchange dense/sparse node set iteration: Switching the node set order in our algorithm. Specifically, we learn over the dense and sparse node set at even and odd steps respectively.

  • Input GT adjacency relation: Instead of learning the relation weights from the output poses, we replace the dynamically-evolving relation weights with static ground truth adjacency relation between two parts, which is a binary value. Note the ground truth relation only covers the adjacency, not the symmetry and any other type of relations.

  • Reasoning relation from geometry: We modify the dynamic relation reasoning module by replacing the input pose information in Equation 5 of the paper with part point cloud transformed by the estimated pose.

From Table 3, we observe that the proposed iterative GNN backbone, dynamic relation reasoning module and dynamic part aggregation module all contribute to the assembly quality significantly. Experimentally, we also exchange the order of sparse and dense node set in the iterative graph learning process, and do not observe much difference compared to our full algorithm. In order to justify our learned relation weights, we employ the ground truth binary adjacency relations to replace learned ones, and observe much worse performance than our learned relations. Finally, instead of learning the relation from estimated poses as in Equation 5 of the main paper, we alternatively replace the pose with transformed part point cloud, and also observe degraded performance as analyzed in the paper.

Shape CD PA CA
Our backbone w/o graph learning 0.0086 26.05 28.07
Our backbone 0.0055 42.09 35.87
Our backbone + relation reasoning 0.0052 46.85 38.60
Our backbone + part aggregation 0.0051 48.01 38.13
Exchange dense/sparse node set iteration 0.0052 49.19 39.62
Input GT adjacency relation 0.0053 45.43 35.66
Reasoning relation from geometry 0.0053 45.11 39.21
Our full algorithm 0.0050 49.51 39.96
Table 3: Ablation study to demonstrate the effectiveness of each component of our algorithm. Shape CD, PA and CA are short for Shape Chamfer Distance, Part Accuracy and Connectivity Accuracy respectively.

B. Additional analysis of the dynamic graph

We demonstrate additional learned relation weights in Figure 6. In the chair category, we expend the four parts in the main paper to eight parts, and we observe that the central parts (back, seat, head) have larger emitted relations and the peripheral parts (regular_leg, arm, footrest, pedestal_base, star_leg) have larger received relations. It reveals the same fact as shown in the main paper that the central parts guide the assembly process. Similar phenomenon can also be observed in the lamp and table categories, which demonstrate only four parts due to the limited common parts existing in the dataset.

Figure 6: Demonstration of the learned relation weights on additional parts and object categories. The number denotes the weight emitted from or received by the specific part averaged over all the other parts. The orange color is the top three or one relation weight in each column.

C. Training details

Our framework is implemented with Pytorch. The network is trained for around 200 epochs to converge. The initial learning rate is set as 0.001, and we employ Adam to optimize the whole framework. We append the supervision on the output poses from all the time steps to accelerate the convergence. The graph neural network loops in five iterations for the final output pose. Experimentally, we find out that the performance tends to approach saturation for five iterations, while we haven’t observed obvious improvement with more iterations.

In order to compute geometrically-equivalent parts, we firstly filter out the parts whose dimension difference of Axis-Aligned-Bounding-Boxes is above a threshold of 0.1, then cope with the remaining parts by excluding all the pairwise parts whose chamfer distance is below an empirical threshold of 0.2.

D. Failure cases and future work

In Figure 7, we show a few cases where our algorithm fails to assemble a well-connected shape and hence generates some floating parts. For example, the legs and arms are disconnected/misaligned from the chair base and back. This indicates the fact that our algorithm design lacks the physical constraints to enforce the connection among the parts. Our learned dynamic part graph builds a soft relation between the central and peripheral parts for a progressive part assembly procedure, but is short of the hard connection constraints. In the future work, we plan to solve this problem by developing a joint-centric assembly framework to focus more on the relative displacement and rotation between the parts, to facilitate the current part-centric algorithm.

Ground truth Ours Ground truth Ours Ground truth Ours
Figure 7: Failure cases where some parts are floating and disconnected from the other parts.

E. Additional results of structural variation

Many previous works Gao et al. (2019); Wu et al. (2019); Wang et al. (2018); Mo et al. (2020b) learn to create a novel shape from scratch by embedding both the geometric and structural diversity into the generative networks. However, provided the part geometry, our problem only allows structural diversity to be modeled. It poses a bigger challenge to the generative model since some shapes may be assembled differently, while the others can only be assembled uniquely (i.e., only one deterministic result).

We demonstrate additional diverse assembled shapes with our algorithm in Figure 8. In the top five visual examples, it exhibits various results of structural variation, while in the bottom three examples, our algorithm learns to predict the same assembly results due to the limited input part set.

F. Additional qualitative results

We demonstrate additional visual results predicted by our dynamic graph learning algorithm in Figure 910 and 11.

Assembly 1 Assembly 2 Assembly 3 Ground truth
Figure 8: Additional diverse results generated by our network. Top: structural variation demonstrated in part assembly; Bottom: no structural variation for the case with very limited parts.
B-Complement B-Global B-LSTM Ours Ground truth
Figure 9: Additional qualitative comparison between our algorithm and the baseline methods on the PartNet Chair dataset
B-Complement B-Global B-LSTM Ours Ground truth
Figure 10: Additional qualitative comparison between our algorithm and the baseline methods on the PartNet Table dataset
B-Complement B-Global B-LSTM Ours Ground truth
Figure 11: Additional qualitative comparison between our algorithm and the baseline methods on the PartNet Lamp dataset


  • P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. J. Guibas (2018) Learning representations and generative models for 3d point clouds. Cited by: §4.3.
  • S. Chaudhuri, E. Kalogerakis, L. Guibas, and V. Koltun (2011) Probabilistic reasoning for assembly-based 3d modeling. In ACM SIGGRAPH 2011 papers, pp. 1–10. Cited by: §2.
  • S. Chaudhuri and V. Koltun (2010) Data-driven suggestions for creativity support in 3d modeling. In ACM SIGGRAPH Asia 2010 papers, pp. 1–10. Cited by: §1, §2.
  • V. S. Chen, P. Varma, R. Krishna, M. Bernstein, C. Re, and L. Fei-Fei (2019) Scene graph prediction with limited labels. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2580–2590. Cited by: §3.2.
  • A. Dubrovina, F. Xia, P. Achlioptas, M. Shalah, R. Groscot, and L. J. Guibas (2019) Composite shape modeling via latent space factorization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8140–8149. Cited by: §2.
  • H. Fan, H. Su, and L. J. Guibas (2017) A point set generation network for 3d object reconstruction from a single image. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 605–613. Cited by: §3.4.
  • T. Funkhouser, M. Kazhdan, P. Shilane, P. Min, W. Kiefer, A. Tal, S. Rusinkiewicz, and D. Dobkin (2004) Modeling by example. ACM transactions on graphics (TOG) 23 (3), pp. 652–663. Cited by: §2.
  • L. Gao, J. Yang, T. Wu, Y. Yuan, H. Fu, Y. Lai, and H. Zhang (2019) SDM-net: deep generative network for structured deformable mesh. ACM Transactions on Graphics (TOG) 38 (6), pp. 1–15. Cited by: §2, §3.2, §6.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
  • P. Jaiswal, J. Huang, and R. Rai (2016) Assembly-based conceptual 3d modeling with unlabeled components using probabilistic factor graph. Computer-Aided Design 74, pp. 45–54. Cited by: §2.
  • E. Kalogerakis, S. Chaudhuri, D. Koller, and V. Koltun (2012) A probabilistic model for component-based shape synthesis. ACM Transactions on Graphics (TOG) 31 (4), pp. 1–11. Cited by: §2.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In Proc. Int. Conf. on Learning Representations.. Cited by: §2.
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §3.2.
  • J. Li, C. Niu, and K. Xu (2020a) Learning part generation and assembly for structure-aware shape synthesis. AAAI. Cited by: §1, §2, §4.2.
  • J. Li, K. Xu, S. Chaudhuri, E. Yumer, H. Zhang, and L. Guibas (2017a) Grass: generative recursive autoencoders for shape structures. ACM Transactions on Graphics (TOG) 36 (4), pp. 1–14. Cited by: §2.
  • M. Li, A. G. Patil, K. Xu, S. Chaudhuri, O. Khan, A. Shamir, C. Tu, B. Chen, D. Cohen-Or, and H. Zhang (2019) Grains: generative recursive autoencoders for indoor scenes. ACM Transactions on Graphics (TOG) 38 (2), pp. 1–16. Cited by: §3.2.
  • Y. Li, K. Mo, L. Shao, M. Sung, and L. Guibas (2020b) Learning 3d part assembly from a single image. arXiv preprint arXiv:2003.09754. Cited by: §2, §4.3.
  • Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang (2017b) Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1261–1270. Cited by: §3.2.
  • Y. Litvak, A. Biess, and A. Bar-Hillel (2019) Learning pose estimation for high-precision robotic assembly using simulated depth images. In International Conference on Robotics and Automation (ICRA), pp. 3521–3527. Cited by: §2.
  • J. Luo, E. Solowjow, C. Wen, J. A. Ojea, A. M. Agogino, A. Tamar, and P. Abbeel (2019) Reinforcement learning on variable impedance controller for high-precision robotic assembly. In 2019 International Conference on Robotics and Automation (ICRA), pp. 3080–3087. Cited by: §2.
  • K. Mo, P. Guerrero, L. Yi, H. Su, P. Wonka, N. Mitra, and L. J. Guibas (2019a) Structurenet: hierarchical graph networks for 3d shape generation. ACM Transactions on Graphics (TOG). Cited by: §2, §3.2.
  • K. Mo, P. Guerrero, L. Yi, H. Su, P. Wonka, N. Mitra, and L. J. Guibas (2020a) StructEdit: learning structural shape variations. Cited by: §2.
  • K. Mo, H. Wang, X. Yan, and L. J. Guibas (2020b) PT2PC: learning to generate 3d point cloud shapes from part tree conditions. arXiv preprint arXiv:2003.08624. Cited by: §2, §6.
  • K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su (2019b) Partnet: a large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 909–918. Cited by: §1, §2, §4.1.
  • C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660. Cited by: §3.1.
  • D. Ritchie, K. Wang, and Y. Lin (2019) Fast and flexible indoor scene synthesis via deep convolutional generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6182–6190. Cited by: §3.2.
  • N. Schor, O. Katzir, H. Zhang, and D. Cohen-Or (2019) CompoNet: learning to generate the unseen by part synthesis and composition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8759–8768. Cited by: §1, §2, §4.2.
  • L. Shao, T. Migimatsu, and J. Bohg (2019) Learning to scaffold the development of robotic manipulation skills. Cited by: §2.
  • C. Shen, H. Fu, K. Chen, and S. Hu (2012) Structure recovery by part assembly. ACM Transactions on Graphics (TOG) 31 (6), pp. 1–11. Cited by: §1, §2.
  • M. Sung, H. Su, V. G. Kim, S. Chaudhuri, and L. Guibas (2017) Complementme: weakly-supervised component suggestions for 3d modeling. ACM Transactions on Graphics (TOG) 36 (6), pp. 1–12. Cited by: §1, §2, §4.2.
  • H. Wang, N. Schor, R. Hu, H. Huang, D. Cohen-Or, and H. Huang (2018) Global-to-local generative model for 3d shapes. ACM Transactions on Graphics (TOG) 37 (6), pp. 1–10. Cited by: §2, §6.
  • K. Wang, Y. Lin, B. Weissmann, M. Savva, A. X. Chang, and D. Ritchie (2019a) Planit: planning and instantiating indoor scenes with relation graph and spatial prior networks. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–15. Cited by: §3.2.
  • Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019b) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (TOG) 38 (5), pp. 1–12. Cited by: §3.2.
  • R. Wu, Y. Zhuang, K. Xu, H. Zhang, and B. Chen (2020) PQ-net: a generative part seq2seq network for 3d shapes. CVPR. Cited by: §1, §2, §4.2.
  • Z. Wu, X. Wang, D. Lin, D. Lischinski, D. Cohen-Or, and H. Huang (2019) Sagnet: structure-aware generative network for 3d-shape modeling. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–14. Cited by: §2, §6.
  • D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei (2017) Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5419. Cited by: §3.2.
  • J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh (2018) Graph r-cnn for scene graph generation. In Proceedings of the European conference on computer vision (ECCV), pp. 670–685. Cited by: §3.2.
  • L. Yi, V. G. Kim, D. Ceylan, I. Shen, M. Yan, H. Su, C. Lu, Q. Huang, A. Sheffer, and L. Guibas (2016) A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (TOG) 35 (6), pp. 1–12. Cited by: §1, §2.
  • K. Zakka, A. Zeng, J. Lee, and S. Song (2019) Form2Fit: learning shape priors for generalizable assembly from disassembly. Cited by: §2.
  • L. Zhang, D. Xu, A. Arnab, and P. H. Torr (2020) Dynamic graph message passing networks. Cited by: §3.2.
  • Y. Zhou, Z. While, and E. Kalogerakis (2019) SceneGraphNet: neural message passing for 3d indoor scene augmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7384–7392. Cited by: §3.2.