Generative 3D Part Assembly via Dynamic Graph Learning, NeurIPS 2020
Autonomous part assembly is a challenging yet crucial task in 3D computer vision and robotics. Analogous to buying an IKEA furniture, given a set of 3D parts that can assemble a single shape, an intelligent agent needs to perceive the 3D part geometry, reason to propose pose estimations for the input parts, and finally call robotic planning and control routines for actuation. In this paper, we focus on the pose estimation subproblem from the vision side involving geometric and relational reasoning over the input part geometry. Essentially, the task of generative 3D part assembly is to predict a 6-DoF part pose, including a rigid rotation and translation, for each input part that assembles a single 3D shape as the final output. To tackle this problem, we propose an assembly-oriented dynamic graph learning framework that leverages an iterative graph neural network as a backbone. It explicitly conducts sequential part assembly refinements in a coarse-to-fine manner, exploits a pair of part relation reasoning module and part aggregation module for dynamically adjusting both part features and their relations in the part graph. We conduct extensive experiments and quantitative comparisons to three strong baseline methods, demonstrating the effectiveness of the proposed approach.READ FULL TEXT VIEW PDF
Generative 3D Part Assembly via Dynamic Graph Learning, NeurIPS 2020
It is a complicated and laborious task, even for humans, to assemble an IKEA furniture from its parts. Without referring to any procedural or external guidance, e.g. reading the instruction manual, or watching a step-by-step video demonstration, the task of 3D part assembly involves exploring an extremely large solution spaces and reasoning over the input part geometry for candidate assembly proposals. To assemble a physically stable furniture, a rich set of part relations and constraints need to be satisfied for a successful assembly.
There are some literature in the computer vision and graphics fields that study part-based 3D shape modeling and synthesis. For example, Chaudhuri and Koltun (2010); Shen et al. (2012); Sung et al. (2017) employ a third-party repository of 3D meshes for part retrieval to assemble a complete shape. Benefiting from recent large-scale part-level datasets Mo et al. (2019b); Yi et al. (2016)
and the advent of deep learning techniques, some recent worksLi et al. (2020a); Schor et al. (2019); Wu et al. (2020) leverage deep neural networks to sequentially generate part geometry and posing transform for shape composition. Though similar to our task, none of these works addresses the exactly same setting to ours. They either allow free-form part generation for the part geometry, or assume certain part priors, such as a known number of parts, known part semantics, an available large part pool, etc. In our setting, we assume no semantic knowledge upon the input parts and assemble 3D shapes conditioned on a given set of fine-grained part geometry with variable number of parts for different shape instances.
In this paper, we propose to use a dynamic graph learning framework that predicts a 6-DoF part pose, including a rigid rotation and translation, for each input part point cloud via forming a dynamically varying part graph and iteratively reasoning over the part poses and their relations. We employ an iterative graph neural network to gradually refine the part pose estimations in a coarse-to-fine manner. At the core of our method, we propose the dynamic part relation reasoning module and the dynamic part aggregation module that jointly learns to dynamically evolve part node and edge features within the part graph.
Lack of the real-world data for 3D part assembly, we train and evaluate the proposed approach on the synthetic PartNet dataset, which provides a large-scale benchmark with ground-truth part assembly for ShapeNet models at the fine-grained part granularity. Although there is no previous work studying the exactly same problem setting as ours, we formulate three strong baselines inspired by previously published works on similar task domains and demonstrate that our method outperforms baseline methods by significant margins.
Diagnostic analysis further indicates that in the iterative part assembly procedure, a set of central parts (e.g. chair back, chair seat) learns much faster than the other peripheral parts (e.g. chair legs, chair arms), which quickly sketches out the shape backbone in the first several iterations. Then, the peripheral parts gradually adjust their poses to match with the central part poses via the graph message-passing mechanism. Such dynamic behaviors are automatically emerged without direct supervision and thus demonstrate the effectiveness for our dynamic graph learning framework.
emphasize the planning, in-hand manipulation and robot grasping using a partial RGB-D observation in an active learning manner, while our work shares more similarity with the work in the vision and graphics background, which focuses on the problem of pose or joint estimation for part assembly. On this side,Funkhouser et al. (2004) is the pioneering work to construct 3D geometric surface models by assembling parts of interest in a repository of 3D meshes. The follow-ups Chaudhuri et al. (2011); Kalogerakis et al. (2012); Jaiswal et al. (2016) learn a probabilistic graphical model that encodes semantic and geometric relationships among shape components to explore the part-based shape modeling. Chaudhuri and Koltun (2010); Shen et al. (2012); Sung et al. (2017) model the 3D shape conditioned on the single-view scan input, rough models created by artists or a partial shape via an assembly manner.
However, most of these previous works rely on a third-part shape repository to query a part for the assembly. Inspired by the recent generative deep learning techniques and benefited from the large-scale annotated object part datasets Mo et al. (2019b); Yi et al. (2016), some recent works Li et al. (2020a); Schor et al. (2019); Wu et al. (2020) generate the parts and then predict the per-part transformation to compose the shape. Dubrovina et al. (2019) introduces a Decomposer-Composer network for a novel factorized shape latent space. These existing data-driven approaches mostly focus on creating a novel shape from the accumulated shape prior, and base the estimated transformation parameters on the 6-DoF part pose of translation and scale. They assume object parts are well rotated to stand in the object canonical space. In this work, we focus on a more practical problem setting, similar to assembling parts into a furniture in IKEA, where all the parts are provided and laid out on the ground in the part canonical space. Our goal of part assembly is to estimate the part-wise 6-DoF pose of rotation and translation to compose the parts into a complete shape (furniture). A recent work Li et al. (2020b) has a similar setting but requires an input image as guidance.
Structure-aware generative networks. Deep generative models, such as generative adversarial networks (GAN) Goodfellow et al. (2014)
and variational autoencoders (VAE)Kingma and Welling (2014), have been explored recently for shape generation tasks. Li et al. (2017a); Mo et al. (2019a) propose hierarchical generative networks to encode structured models, represented as abstracted bounding box. The follow-up work Mo et al. (2020a) extends the learned structural variations into conditional shape editing. Gao et al. (2019) introduces a two-level VAE to jointly learns the global shape structure and fine part geometries. Wu et al. (2019) proposes a two-branch generative network to exchange information between structure and geometry for 3D shape modeling. Wang et al. (2018) presents a global-to-local adversarial network to construct the overall structure of the shape, followed by a conditional autoencoder for part refinement. Recently, Mo et al. (2020b) employs a conditional GAN to generate a point cloud from an input rough shape structure. Most of the aforementioned works couple the shape structure and geometry into the joint learning for diverse and perceptually plausible 3D modeling. However, we focus on a more challenging problem that aims at generating shapes with only structural variations conditioned on the fixed detailed part geometry.
Given a set of 3D part point clouds as inputs, where denotes the number of parts which may vary for different shapes, the goal of our task is to predict a 6-DoF part pose for each input part and form a final part assembly for a complete 3D shape , where denotes the transformed part point cloud according to . The input parts may come in many geometrically-equivalent groups, e.g. four chair legs, two chair arms, where the parts in each group share the same part geometry and we assume to know the part count in each group.
To tackle this problem, we propose an assembly-oriented dynamic graph learning framework that leverages an iterative graph neural network (GNN) as a backbone, which explicitly conducts sequential part assembly refinements in a coarse-to-fine manner, and exploits a pair of part relation reasoning module and part aggregation module for iteratively adjusting part features and their relations in the part graph. Figure 1 illustrates our proposed pipeline. Below, we first introduce the iterative GNN backbone and then discuss the dynamic part relation reasoning module and part aggregation module in detail.
We represent the dynamic part graph at every time step as a self-looped directed graph , where is the set of nodes and is the set of edges in . We treat each part as a node in the graph and initialize its attribute via encoding the part geometry as , where is a parametric function implemented as a vanilla PointNet Qi et al. (2017) that extracts a global permutation-invariant feature summarizing the input part point cloud . We use a shared PointNet to process all the parts.
We use a fully connected graph, drawing an edge among all pairs of parts, and perform the iterative graph message-passing operations via alternating between updating the edge and node features. To be specific, the edge attribute emitting from node to at time step is calculated as a neural message
which is then leveraged to update the node attribute at the next time step by aggregating messages from all the other nodes
that takes both the previous node attribute and the averaged message among neighbors as inputs. The part pose
, including a 3-DoF rotation represented as a unit 4-dimensional Quaternion vector and a 3-dimensional translation vector denoting the part center offset, is then regressed by decoding the updated node attribute via
Besides the node feature at the current time step , also takes as input the initial node attribute to capture the raw part geometry information, and the estimated pose in the last time step for more coherent pose evolution. Note that is not defined and hence not inputted to at the first time step.
In our implement, , and
are all parameterized as Multi-Layer Perceptrons (MLP) that are shared across all the edges or nodes for each time step. Note that we use different network weights for different iterations, since the node and edge features evolve over time and may contain information at different scales. Our iterative graph neural network runs for 5 iterations and learns to sequentially refine part assembly in a coarse-to-fine manner.
The relationship between entities is known to be important for a better understanding of visual data. There are various relation graphs defined in the literature. Xu et al. (2017); Li et al. (2017b); Yang et al. (2018); Chen et al. (2019) learn the scene graph from the labeled object relationship in a large-scale image dataset namely Visual Genome Krishna et al. (2017), in favor of the 2D object detection task. Li et al. (2019); Zhou et al. (2019); Wang et al. (2019a); Ritchie et al. (2019)
calculate the statistical relationships between objects via some geometrical heuristics for the 3D scene generation problem. In terms of the shape understanding,Mo et al. (2019a); Gao et al. (2019) define the relation as the adjacency or symmetry between every two parts in the full shape.
In our work, we learn dynamically evolving part relationships for the task of part assembly. Different from many previous dynamic graph learning algorithms Wang et al. (2019b); Zhang et al. (2020) that only evolve the node and edge features implicitly, we propose to update the relation graph based on the estimated part assembly at each time step explicitly. This is special for our part assembly task and we incorporate the assembly-flavor in our model design. At each time step, we predict the part pose for each part and the obtained temporary part assembly enables explicit reasoning about how to refine part poses in the next step considering the current part disconnections and the overall shape geometry.
To achieve this goal, besides the maintained edge attributes, we also learn to reason a directed edge-wise weight scalar to indicate the significance of the relation from node to . Then, we update the node attribute at time step via multiplying the weight scalar and edge attribute
There are various options to implement . For example, one can employ the part geometry transformed with the estimated poses to regress the relation, or incorporate the holistic assembled shape feature. In our implementation, however, we find that directly exploiting the pose features to learn the relation is already satisfactory. This is mainly caused by the fact that the parts of different semantics may share similar geometries but usually have different poses. For instance, the leg and leg stretcher are geometrically similar but placed and oriented very differently. Therefore, we adopt the simple solution by reasoning the relation only from the estimated poses
where both and are parameterized as MLPs. is used to extract the independent pose feature from each part pose prediction. Note that we set in the beginning.
We observe that geometrically-equivalent parts are highly correlated and thus very likely to share common knowledge regarding part poses and relationships. For example, four long sticks may all serve as legs that stand upright on the ground, or two leg stretchers that are parallel to each other. Thus, we propose a dynamic part aggregation module that allows more direct information exchanges among geometrically-equivalent parts.
To achieve this goal, we explicitly create two sets of nodes at different assembly levels: a dense node set including all the part nodes, and a sparse node set created by aggregating all the geometrically-equivalent nodes into a single node. Then, we perform the graph learning via alternatively updating the relation graph between the dense and sparse node sets. In this manner, we allow dynamic communications among geometrically-equivalent parts for synchronizing shared information while learning to diverge to different part poses.
In our implementation, we denote as the dense node graph at time step . To create a sparse node graph at time step , we firstly aggregate the node attributes among the geometrically-equivalent parts
via max-pooling into a single nodeas
Then, we aggregate the relation weights from geometrically-equivalent parts to any other node
The inverse relation emitted from the node to the aggregated node is computed similarly as
All these equations are conducted once we finish the update of dense node graph , then we are able to operate on the sparse node graph following Eq. (4) and (5). To enrich the sparse node set back to dense node set, we simply unpool the node features to a corresponding set of geometrically-equivalent parts. We alternatively conduct dynamic graph learning over the dense and sparse node sets at odd and even iterations separately.
Given an input set of part point clouds, there may be multiple solutions for the shape assembly. For example, one can move a leg stretcher up and down as long as it is connected to two parallel legs. The chair back can also be possibly laid down to form a deck chair. To address the multi-modal predictions, we employ the Min-of-N (MoN) loss Fan et al. (2017) to balance between the assembly quality and diversity. Let denote our whole framework, which takes in the part point cloud set and a random noise
sampled from unit Gaussian distribution. Let
be any loss function supervising the network outputsand be one provided ground truth sample in the dataset, then the MoN loss is defined as
The MoN loss encourages at least one of the predictions to be close to the ground truth data, which is more tolerant to the dataset of limited diversity and hence more suitable for our problem. In practice, we sample 5 particles of to approximate Eq. (9).
The is implemented as a weighted combination of both local part and global shape losses, detailed as below. Each part pose can be decomposed into rotation and translation . We supervise the translation via an loss,
The rotation is supervised via Chamfer distance on the rotated part point cloud
In order to achieve good assembly quality holistically, we also supervise the full shape assembly using Chamfer distance (CD),
In all equations above, the asterisk symbols denote the corresponding ground-truth values.
We conduct extensive experiments demonstrating the effectiveness of the proposed method and show quantitative and qualitative comparisons to three baseline methods. We also provide diagnostic analysis over the learned part relation dynamics, which clearly illustrates the iterative coarse-to-fine refinement procedure.
We leverage the recent PartNet Mo et al. (2019b), a large-scale shape dataset with fine-grained and hierarchical part segmentations, for both training and evaluation. We use the three largest categories, chairs, tables and lamps, and adopt its default train/validation/test splits in the dataset. In total, there are 6,323 chairs, 8,218 tables and 2,207 lamps. We deal with the most fine-grained level of PartNet segmentation. We use Furthest Point Sampling (FPS) to sample 1,000 points for each part point cloud. All parts are zero-centered and provided in the canonical part space computed using PCA.
Since our task is novel, there is no direct baseline method to compare. However, we try to compare to three baseline methods inspired by previous works sharing similar spirits of part-based shape modeling or synthesis.
B-Complement: ComplementMe Sung et al. (2017) studies the task of synthesizing 3D shapes from a big repository of parts and mostly focus on retrieving part candidates from the part database. We modify the setting to our case by limiting the part repository to the input part set and sequentially predicting a part pose for each part.
B-LSTM: Instead of leveraging a graph structure to encode and decode part information jointly, we use a bidirectional LSTM module similar to PQ-Net Wu et al. (2020) to sequentially estimate the part pose. Note that the original PQ-Net studies the task of part-aware shape generative modeling, which is a quite different task from ours.
B-Global: Without using the iterative GNN, we directly use the per-part feature, augmented with the global shape descriptor, to regress the part pose in one shot. Though dealing with different tasks, this baseline method borrows similar network design with CompoNet Schor et al. (2019) and PAGENet Li et al. (2020a).
|Shape Chamfer Distance||Part Accuracy||Connectivity Accuracy|
We use the Minimum Matching Distance (MMD) Achlioptas et al. (2018) to evaluate the fidelity of the assembled shape. Conditioned on the same input set of parts, we generate multiple shapes sampled from different Gaussian noises, and measure the minimum distance between the ground truth and the assembled shapes. We adopt three distance metrics, part accuracy, shape chamfer distance following Li et al. (2020b) and connectivity accuracy proposed by us. The part accuracy is defined as,
where we pick . Intuitively, it indicates the percentage of parts that match the ground truth parts to a certain CD threshold. Shape chamfer distance is calculated the same as Eq. 12.
Connectivity Accuracy. The part accuracy measures the assembly performance by considering each part separately. In this work, we propose the connectivity accuracy to further evaluate how well the parts are connected in the assembled shape. For each connected part pair <> in the object space, we firstly select one point in part that is closest to part as ’s contact point with respect to , then select the point in that is closest to as the corresponding ’s contact point . Given the predefined contact point pair located in the object space, we transform each point into its corresponding canonical part space as . Then we calculate the connectivity accuracy of an assembled shape as
where denotes the set of contact point pairs and . It evaluates the percentage of correctly connected parts.
|Shape Chamfer Distance||Part Accuracy||Connectivity Accuracy|
|Our backbone w/o graph learning||0.0086||26.05||28.07|
|Our backbone w. relation reasoning||0.0052||46.85||38.60|
|Our full algorithm||0.0050||49.51||39.96|
We present the quantitative comparisons with the baselines in Table 1. Our algorithm outperforms all these approaches by a significant margin for most columns, especially on the part and connectivity accuracy metrics. According to the visual results in Figure 2 (left), we also observe the best assembly results are achieved by our algorithm, while the baseline methods usually fail in producing well-structured shapes. We also show multiple assembled shapes in Figure 2 (right) while sampling different Gaussian noises as inputs. We see that some bar-shape parts are assembled into different positions to form objects of different structures.
|Step 1||Step 2||Step 3||Step 4||Step 5||Ground Truth|
We also try to remove the three key components from our method: the iterative GNN, the dynamic part relation reasoning module and the dynamic part aggregation module. The results on the PartNet Table category are shown in Table 3 and we see that our full model achieves the best performance compared to the ablated versions. We firstly justify the effectiveness of our backbone by replacing the graph learning module with a multi-layer perception that estimates the part-wise pose from the concatenated separate and overall part features. We further incorporate the dynamic graph module into our backbone for evaluation. We observe that our proposed backbone and dynamic graph both contribute to the final performance significantly.
Figure 6 summarizes the learned relation weights at each time step by averaging over all PartNet chairs. We pick the four common types of parts: back, seat, leg and arm. We see clearly similar statistical patterns for the even iterations and for the odd ones. At even iterations, the relation graph is updated from the dense node set. It focuses more on passing messages from the central parts (i.e. seat, back) to the peripheral parts (i.e. leg, arm) and overlooks the relation between legs. While at odd iterations, the relation graph is updated from the sparse node set, where the geometrically-equivalent parts like legs are aggregated to a single node. In this case, from the relation graph we can see that all the parts are influenced relatively more by the minor parts. On average, the central parts have bigger emitting relation weights than the peripheral parts, indicating that the central parts guide the assembly process more.
We further illustrate the changes of part accuracy and its associated improvement at each time step in Figure 5. We find that the central parts are consistently predicted more accurately than the peripheral parts. Interestingly, the improvement of peripheral parts is relatively higher than central parts at even iterations, demonstrating the fact that central parts guide the pose predictions for the peripheral parts. Figure 8 visualizes the time-varying part assembly results, showing that the poses for the central parts are firstly determined and then the peripheral parts gradually adjust their poses to match the central parts. The results finally converge to stable part assembly predictions.
In this work, we propose a novel dynamic graph learning algorithm for the part assembly problem. We develop an iterative graph neural network backbone that learns to dynamically evolve node and edge features within the part graph, augmented with the dynamic relation reasoning module and the dynamic part aggregation module. Through thorough experiments and analysis, we demonstrate that the proposed method achieves state-of-the-art performance over the three baseline methods by learning an effective assembly-oriented relation graph. Future works may investigate learning better part assembly generative models considering the part joint information and higher-order part relations.
This work was supported by the start-up research funds from Peking University (7100602564) and the Center on Frontiers of Computing Studies (7100602567). We would also like to thank Imperial Institute of Advanced Technology for GPU supports.
This document provides the additional supplemental material that cannot be included into the main paper due to its page limit:
Additional ablation study.
Analysis of dynamic graph on additional parts and object categories.
Failure cases and future work.
Additional results of structural variation.
Additional qualitative results.
In this section, we demonstrate the effectiveness of different components. We test our framework by proposing the following variants, where the results are in Table 3.
Our backbone w/o graph learning: Replacing the graph learning module with a multi-layer perception to estimate the part-wise pose from the concatenated separate and overall part features.
Our backbone: Only the iterative graph learning module.
Our backbone + relation reasoning: Incorporate the dynamic relation reasoning module to our backbone.
Our backbone + part aggregation: Incorporate the dynamic part aggregation module to our backbone.
Exchange dense/sparse node set iteration: Switching the node set order in our algorithm. Specifically, we learn over the dense and sparse node set at even and odd steps respectively.
Input GT adjacency relation: Instead of learning the relation weights from the output poses, we replace the dynamically-evolving relation weights with static ground truth adjacency relation between two parts, which is a binary value. Note the ground truth relation only covers the adjacency, not the symmetry and any other type of relations.
Reasoning relation from geometry: We modify the dynamic relation reasoning module by replacing the input pose information in Equation 5 of the paper with part point cloud transformed by the estimated pose.
From Table 3, we observe that the proposed iterative GNN backbone, dynamic relation reasoning module and dynamic part aggregation module all contribute to the assembly quality significantly. Experimentally, we also exchange the order of sparse and dense node set in the iterative graph learning process, and do not observe much difference compared to our full algorithm. In order to justify our learned relation weights, we employ the ground truth binary adjacency relations to replace learned ones, and observe much worse performance than our learned relations. Finally, instead of learning the relation from estimated poses as in Equation 5 of the main paper, we alternatively replace the pose with transformed part point cloud, and also observe degraded performance as analyzed in the paper.
|Our backbone w/o graph learning||0.0086||26.05||28.07|
|Our backbone + relation reasoning||0.0052||46.85||38.60|
|Our backbone + part aggregation||0.0051||48.01||38.13|
|Exchange dense/sparse node set iteration||0.0052||49.19||39.62|
|Input GT adjacency relation||0.0053||45.43||35.66|
|Reasoning relation from geometry||0.0053||45.11||39.21|
|Our full algorithm||0.0050||49.51||39.96|
We demonstrate additional learned relation weights in Figure 6. In the chair category, we expend the four parts in the main paper to eight parts, and we observe that the central parts (back, seat, head) have larger emitted relations and the peripheral parts (regular_leg, arm, footrest, pedestal_base, star_leg) have larger received relations. It reveals the same fact as shown in the main paper that the central parts guide the assembly process. Similar phenomenon can also be observed in the lamp and table categories, which demonstrate only four parts due to the limited common parts existing in the dataset.
Our framework is implemented with Pytorch. The network is trained for around 200 epochs to converge. The initial learning rate is set as 0.001, and we employ Adam to optimize the whole framework. We append the supervision on the output poses from all the time steps to accelerate the convergence. The graph neural network loops in five iterations for the final output pose. Experimentally, we find out that the performance tends to approach saturation for five iterations, while we haven’t observed obvious improvement with more iterations.
In order to compute geometrically-equivalent parts, we firstly filter out the parts whose dimension difference of Axis-Aligned-Bounding-Boxes is above a threshold of 0.1, then cope with the remaining parts by excluding all the pairwise parts whose chamfer distance is below an empirical threshold of 0.2.
In Figure 7, we show a few cases where our algorithm fails to assemble a well-connected shape and hence generates some floating parts. For example, the legs and arms are disconnected/misaligned from the chair base and back. This indicates the fact that our algorithm design lacks the physical constraints to enforce the connection among the parts. Our learned dynamic part graph builds a soft relation between the central and peripheral parts for a progressive part assembly procedure, but is short of the hard connection constraints. In the future work, we plan to solve this problem by developing a joint-centric assembly framework to focus more on the relative displacement and rotation between the parts, to facilitate the current part-centric algorithm.
|Ground truth||Ours||Ground truth||Ours||Ground truth||Ours|
Many previous works Gao et al. (2019); Wu et al. (2019); Wang et al. (2018); Mo et al. (2020b) learn to create a novel shape from scratch by embedding both the geometric and structural diversity into the generative networks. However, provided the part geometry, our problem only allows structural diversity to be modeled. It poses a bigger challenge to the generative model since some shapes may be assembled differently, while the others can only be assembled uniquely (i.e., only one deterministic result).
We demonstrate additional diverse assembled shapes with our algorithm in Figure 8. In the top five visual examples, it exhibits various results of structural variation, while in the bottom three examples, our algorithm learns to predict the same assembly results due to the limited input part set.
|Assembly 1||Assembly 2||Assembly 3||Ground truth|
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 605–613. Cited by: §3.4.