3DPartAssembly
None
view repo
Autonomous assembly is a crucial capability for robots in many applications. For this task, several problems such as obstacle avoidance, motion planning, and actuator control have been extensively studied in robotics. However, when it comes to task specification, the space of possibilities remains underexplored. Towards this end, we introduce a novel problem, singleimageguided 3D part assembly, along with a learningbased solution. We study this problem in the setting of furniture assembly from a given complete set of parts and a single image depicting the entire assembled object. Multiple challenges exist in this setting, including handling ambiguity among parts (e.g., slats in a chair back and leg stretchers) and 3D pose prediction for parts and part subassemblies, whether visible or occluded. We address these issues by proposing a twomodule pipeline that leverages strong 2D3D correspondences and assemblyoriented graph messagepassing to infer part relationships. In experiments with a PartNetbased synthetic benchmark, we demonstrate the effectiveness of our framework as compared with three baseline approaches.
READ FULL TEXT VIEW PDFNone
The important and seemingly straightforward task of furniture assembly presents serious difficulties for autonomous robots. A general robotic assembly task consists of action sequences incorporating the following stages: (1) picking up a particular part, (2) moving it to a desired 6D pose, (3) mating it precisely with the other parts, (4) returning the manipulator to a pose appropriate for the next pickup movement. Solving such a complicated highdimensional motion planning problem [25, 21] requires considerable time and engineering effort. Current robotic assembly solutions first determine the desired 6D pose of parts [9] and then hardcode the motion trajectories for each specific object [54]. Such limited generalizability and painstaking process planning fail to meet demands for fast and flexible industrial manufacturing and household assembly tasks [31].
To generate smooth and collisionfree motion planning and control solutions, it is required to accurately predict 6D poses of parts in 3D space [54, 27]. We propose a 3D part assembly task whose output can reduce the complexity of the highdimensional motion planning problem. We aim to learn generalizable skills that allow robots to autonomously assemble unseen objects from parts [16]
. Instead of handcrafting a fixed set of rules to assemble one specific chair, for example, we explore categorywise structural priors that helps robots to assemble all kinds of chairs. The shared part relationships across instances in a category not only suggest potential pose estimation solutions for unseen objects but also lead to possible generalization ability for robotic control policies
[64, 53, 42, 60].We introduce the task of singleimageguided 3D part assembly: inducing 6D poses of the parts in 3D space [30] from a set of 3D parts and an image depicting the complete object. Robots can acquire geometry information for each part using 3D sensing, but the only information provided for the entire object shape is the instruction image. Different from many structureaware shape modeling works [40, 71, 17, 62, 70, 32, 52], we do not assume any specific granularity or semantics of the input parts, since the given furniture parts may not belong to any known part semantics and some of the parts may be provided preassembled into bigger units. We also step away from instruction manuals illustrating the stepbystep assembling process, as teaching machines to read sequential instructions depicted with natural languages and figures is still a hard problem.
At the core of the task lie several challenges. First, some parts may have similar geometry. For example, distinguishing the geometric subtlety of chair leg bars, stretcher bars, and back bars is a difficult problem. Second, 3D geometric reasoning is essential in finding a joint global solution, where every piece fits perfectly in the puzzle. Parts follow a more rigid relationship graph which determines a unique final solution that emerges from the interactions between the geometries of the parts. Third, the image grounds and selects one single solution from all possible part combinations that might all be valid for the generative task. Thus, the problem is at heart a reconstruction task where the final assembly needs to agree to the input image. Additionally, and different from object localization tasks, the 3D Part Assembly Task must locate all input parts, not only posing the parts visible in the image, but also hallucinating poses for the invisible ones by leveraging learned data priors. One can think of having multiple images to expose all parts to the robot, but this reduces the generalizability to realworld scenarios, and might not be easy to achieve. Thus, we focus on solving the task of singleimage and categorypriorguided pose prediction.
In this paper, we introduce a learningbased method to tackle the proposed singleimageguided 3D part assembly problem. Given the input image and a set of 3D parts, we first focus on 2D structural guidance by predicting an partinstance image segmentation to serve as a 2D3D grounding for the downstream pose prediction. To enforce reasoning involving fine geometric subtleties, we have designed a contextaware 3D geometric feature to help the network reason about each part pose, conditioned on the existence of other parts, which might be of similar geometry. Building on the 2D structural guidance, we can generate a pose proposal for each visible part and leverage these predictions to help hallucinate poses for invisible parts as well. Specifically, we use a part graph network, based on edges to encode different relationships among parts, and design a twophase messagepassing mechanism to take part relationship constraints into consideration in the assembly.
To best of our knowledge, we are the first to assemble unlabeled 3D parts with a single image input. We set up a testbed of the problem on the recently released PartNet [41] dataset. We pick three furniture categories with large shape variations that require part assembly: Chair, Table and Cabinet. We compare our method with several baseline methods to demonstrate the effectiveness of our approach. We follow the PartNet official traintest splits and evaluate all model performances on the unseen test shapes. Extensive ablation experiments also demonstrate the effectiveness and necessity of the proposed modules: 2Dmaskgrounding component and the 3Dmessagepassing reasoning component.
In summary, our contributions are:
we formulate the task of singleimageguided 3D part assembly;
we propose a twomodule method, consisting of a partinstance image segmentation network and an assemblyaware part graph convolution network;
we compare with three baseline methods and conduct ablation studies demonstrating the effectiveness of our proposed method.
We review previous works on 3D pose estimation, singleimage 3D reconstruction, as well as partbased shape modeling, and discuss how they relate to our task.
3D Pose Estimation. Estimating the pose of objects or object parts is a longstanding problem with a rich literature. Early in 2001, Langley et al. [75]
attempted to utilize visual sensors and neural networks to predict the pose for robotic assembly tasks. Andy
et al. [77] built an robotic system taking multiview RGBD images as the input and predicting 6D pose of objects for Amazon Picking Challenge. Recently, Litvak et al. [37] proposed a twostage pose estimation procedure taking depth images as input. In the vision community, there is also a line of works studying instancelevel object pose estimation for known instances [1, 48, 59, 28, 72, 58, 2] and categorylevel pose estimation [19, 44, 3, 63, 7] that can possibly deal with unseen objects from known categories. There are also works on object relocalization from scenes [76, 23, 61]. Different from these works, our task takes as inputs unseen parts without any semantic labels at the test time, and requires certain part relationships and constraints to be held in order to assemble a plausible and physically stable 3D shape.SingleImage 3D Reconstruction. There are previous works of reconstructing 3D shape from a single image with the representations of voxel grids [10, 57, 67, 49], point clouds [15, 34, 22], meshes [65, 69], parametric surfaces [18], and implicit functions [8, 39, 45, 51, 74]. While one can consider employing such 2Dto3D lifting techniques as a prior step in our assembly process so that the given parts can be matched to the predicted 3D shape, it can misguide the assembly in multiple ways. For instance, the 3D prediction can be inaccurate, and even some small geometric differences can be crucial for part pose prediction. Also, the occluded area can be hallucinated in different ways. In our case, the set of parts that should compose the object is given, and thus the poses of occluded parts can be more precisely specified. Given these, we do not leverage 3D shape generation techniques and directly predict the part poses from the input 2D image.
PartBased Shape Modeling. 3D shapes have compositional part structures. Chaudhuri et al. [5], Kalogerakis et al. [26] and Jaiswal et al. [24] introduced frameworks learning probabilistic graphical models that describe pairwise relationships of parts. Chaudhuri and Koltun [6], Sung et al. [55] and Sung et al. [56] predict the compatibility between a part and a partial object for sequential shape synthesis by parts. Dubrovina et al. [13], PAGENet [32] and CompoNet [52] take the set of parts as the input and generates the shape of assembled parts. Different from these works that usually assume known part semantics or a part database, our task takes a set of unseen parts during the test time and we do not assume any provided part semantic labels.
GRASS [33], Im2Struct [43] and StructureNet [40] learns to generate boxabstracted shape hierarchical structures. SAGNet [71] and SDMNet [17] learn the pairwise relationship among parts that are subsequently integrated into a latent representation of the global shape. G2LGAN [62]autoencodes the shape of an entire object with perpoint part labels, and a subsequent network in the decoding refines the geometry of each part. PQNet [70] represents a shape as a sequence of parts and generates each part at every step of the iterative decoding process. All of these works are relevant but different from ours in that we obtain the final geometry of the object not by directly decoding the latent code into part geometry but by predicting the poses of the given parts and explicitly assembling them. There are also works studying partialtofull shape matching [35, 36, 12]. Unlike these works, we use a single image as the guidance, instead of a 3D model.
We define the task of singleimageguided 3D part assembly: given a single RGB image of size depicting a 3D object and a set of 3D part point clouds (), we predict a set of part poses in space. After applying the predicted rigid transformation to all the input parts ’s, the union of them reconstructs the 3D object . We predict output part poses in the camera space, following previous works [14, 66]. In our paper, we use Quaternion to represent rotation and use and interchangeably.
We conduct a series of pose and scale normalization on the input part point clouds to ensure synthetictoreal generalizability. We normalize each part point cloud pose to have a zeromean center and use a local part coordinate system computed using PCA [46]. To normalize the global scale of all training and testing data, we compute AxisAlignedBoundingBoxes (AABB) for all the parts and normalize them so that the longest box diagonal across all ’s of a shape has a unit length while preserving their relative scales. We cluster the normalized part point clouds ’s into sets of geometrically equivalent part classes , where , , etc. For example, four legs of a chair are clustered together if their geometry is identical. This process of grouping indiscernible parts is essential to resolve the ambiguity among them in our framework. is a disjoint complete set such that for every and . We denote the representative point cloud for each class .
We propose a method for the task of singleimageguided 3D part assembly, which is composed of two network modules: the partinstance image segmentation module and the part pose prediction module; see Figure 2 for the overall architecture. We first extract a geometry feature of each part from the input point cloud and generates instancelevel 2D segmentation masks on the input image (). Conditioned on the predicted segmentation masks, our model then leverages both the 2D mask features and the 3D geometry features to propose 6D part poses . We explain these two network modules in the following subsections. See supplementary for the implementation details.
To induce a faithful reconstruction of the object represented in the image, we need to learn a structural layout of the input parts from the 2D input. We predict a part instance mask for each part . All part masks subject to the disjoint constraint, i.e., , where denotes a background mask. If a part is invisible, we simply predict an empty mask and let the second network to halluciate a pose leveraging contextual information and learned data priors. The task difficulties are two folds. First, the network needs to distinguish between the geometric subtlety of the input part point clouds to establish a valid 2D3D correspondence. Second, for the identical parts within each geometrically equivalent class, we need to identify separate 2D mask regions to pinpoint their exact locations. Below, we explain how our proposed method is designed to tackle the above challenges.
ContextAware 3D Part Features. To enable the network to reason the delicate differences between parts, we construct the contextaware 3D conditional feature (), which is computed from three components: part geometry feature
, instance onehot vector
(), and a global part contextual feature . We use PointNet [47] to extract a global geometry feature for each part point cloud . If a part has multiple instances within a geometrically equivalent class (e.g. four chair legs), we introduce an additional instance onehot vector to tell them apart. For part which has only one instance, we use an onehot vector with the first element to be 1. For contextual awareness, we extract a global feature over all the input part point clouds, to facilitate the network to distinguish between similar but not equivalent part geometries (e.g. a short bar or a long bar). Precisely, we first compute and for every part, then computeto obtain perpart local feature, where SLP is short for SingleLayer Perception. We aggregate over all part local features via a maxpooling symmetric function to compute the global contextual feature
. Finally, we define to be the contextaware 3D perpart feature.Conditional UNet Segmentation. We use a conditional UNet [50] for the partinstance segmentation task. Preserving the standard UNet CNN architecture, our encoder takes an 3channel RGB image as input and produce a bottleneck feature map (). Concatenating the image feature with our contextaware 3D part conditional feature , we obtain , where we duplicate along the spatial dimensions for times. The decoder takes the conditional bottleneck feature and decodes a part mask for evert input part
. We keep skip links as introduced in the original UNet paper between encoder and decoder layers. To satisfy the nonoverlapping constraint, we add a SoftMax layer across all predicted masks, augmented with a background mask
.With the 2D grounding masks produced by the partinstance image segmentation module, we predict a 6D part pose for every input part using the part pose prediction module. We predict a unit Quaternion vector that corresponds to a 3D rotation and a translation vector denoting the part center position in the camera space.
Different from object pose estimation, the task of part assembly requires a joint prediction of all part poses. Part pose predictions should not be independent with each other, as part poses follow a set of more rigid relationships, such as symmetry and parallelism. For a valid assembly, parts must be in contact with adjacent parts. The rich part relationships restrict the solution space for each part pose. We leverage a twophase graph convolutional neural network to address the joint communication of part poses for the task of part assembly.
MaskConditioned Part Features. We consider three sources of features for each part: 2D image feature , 2D mask feature (), contextaware 3D part feature . We use a ResNet18 [20]
pretrained on ImageNet
[11] to extract 2D image feature . We use a separate ResNet18 that takes the 1channel binary mask as input and extracts a 2D mask feature , where masks for invisible parts are predicted as empty. Then, finally, we propagate the 3D contextaware part feature introduced in the Sec. 4.1 that encodes 3D part geometry information along with its global context.TwoPhase Graph Convolution. We create a part graph , treating every part as a node and propose a twophase of graph convolution to predict the pose of each part. We first describe how we construct the edges in each phase, and then introduce our assemblyoriented graph convolution operations.
During the first phase, we draw pairwise edges among all parts in every geometrically equivalent part classes and perform graph convolution over , where
(1) 
Edges in allow message passing among geometrically identical parts that are likely to have certain spatial relationships or constraints (e.g. four legs of a chair have two orthogonal reflection planes). After the firstphase graph convolution, each node has an updated node feature. The updated node feature is then decoded as an 6D pose for each part. The predicted part poses produce an initial assembled shape.
We leverage a second phase of graph convolution to refine the predicted part poses. Besides the edges in , we draw a new set of edges by finding top5 nearest neighbors for each part based upon the initial assembly and define . The intuition here is that once we have an initial part assembly, we are able to connect the adjacent parts so that they learn to attach to each other with certain joint constraints.
We implement the graph convolution as two iterations of message passing [73, 68, 40]. Given a part graph with initial node features and edge features , each iteration of message passing starts from computing edge features
(2) 
where we do not use during the first phase of graph convolution, and define if and if for the second phase. Then, we perform averagepooling over all edge features that are connected to a node and obtain the updated node feature
(3) 
We define if there is no edge drawn from node . We define the final node features to be for each phase of graph convolution.
Respectively, we denote the final node feature of first phase and second phase graph convolution to be and for a part .
Part Pose Decoding. After gathering the node features after conducting the twophase graph convolution operations as and , we use a MultipleLayer Perception (MLP) to decode part poses at each phase.
(4) 
To ensure the output of unit Quaternion prediction, we normalize the output vector length so that .
We first train the partinstance image segmentation module until its convergence and then train the part pose prediction module. Empirically, we find that having a good mask prediction is necessary before training for the part poses.
Loss for PartInstance Image Segmentation. We adapt the negative softiou loss from [38] to supervise the training of the partinstance image segmentation module. We perform Hungarian matching [29] within each geometrically equivalent class to guarantee that the loss is invariant to the order of part poses in groundtruth and prediction. The loss is defined as
(5) 
where and denote the ground truth and the matched predicted mask. refers to the matching results that match groundtruth part indices to the predicted ones. includes all 2D index ’s on a image plane.
Losses for Part Pose Prediction. For the pose prediction module, we design an orderinvariant loss by conducting Hungarian matching within each geometryequivalent classes . Additionally, we observe that separating supervision loss for translation and rotation helps stabilize training. We use the following training loss for the pose prediction module.
(6) 
We use the Euclidean distance to measure the difference between the 3D translation prediction and ground truth translation for each part. We denote as the matching results.
(7) 
where and denote the matched predicted translation and the ground truth 3D translation. We use weight parameter of in training.
We use two losses for rotation prediction: Chamfer distance [14] and distance . Because many parts have symmetric geometry (e.g. bars and boards) which results in multiple rotation solutions, we use Chamfer distance as the primary supervising loss to address such pose ambiguity. Given the point cloud of part , the ground truth rotation , and the matched predicted rotation , the Chamfer distance loss is defined as
(8) 
where and denote the rotated part point clouds using and respectively. We use for the Chamfer loss. Some parts may be not perfectly symmetric (e.g. one bar that has small but noticeable different geometry at two ends), using Chamfer distance by itself in this case would make the network fall into local minima. We encourage the network to correct this situation by penalizing the distance between the matched predicted rotated point cloud and the ground truth rotated point cloud in Euclidean distance.
(9) 
where denotes the Frobenius norm, is the number of points per part. Note that on its own is not sufficient in cases when the parts are completely symmetric. Thus, we add the loss as a regularizing term with a smaller weight of . We conducted an ablation experiment demonstrating the loss contributes to correcting rotation for some parts.
Finally, we compute a shape holistic Chamfer distance as the predicted assembly should be close to the ground truth Chamfer distance.
(10) 
where denotes the predicted assembled shape point cloud and denotes the ground truth shape point cloud. This loss encourages the holistic shape appearance and the part relationships to be close to the groundtruth. We use .
In this section, we set up the testbed for the proposed singleimageguided 3D part assembly problem on the PartNet [41] dataset. To validate the proposed approach, we compare against three baseline methods. Both qualitative and quantitative results demonstrate the effectiveness of our method.
Recently, Mo et. al. [41] proposed the PartNet dataset, which is the largest 3D object dataset with finegrained and hierarchical part annotation. Every PartNet object is provided with a groundtruth hierarchical part instancelevel semantic segmentation, from coarse to finegrained levels , which provides a good complexity of parts. In our work, we use the three largest furniture categories that the requires realworld assembly: Chair, Table and Cabinet. We follow the official PartNet train/validation/test split (roughly ) and filter out the shapes with more than 20 parts.
For each object category, we create two data modalities: Level3 and Levelmixed. The Level3 corresponds to the most finegrained PartNet segmentation. While we do not assume known part semantics, an algorithm can implicitly learn the semantic priors dealing with the Level3 data, which is undesired in our goal of generalizing to reallife assembly settings, as it is unrealistic to assume taht IKEA furnitures also follow the PartNet same semantics. To enforce the network to reason with part geometries, we created an additional category modality, Levelmixed, which contains part segmentation at all levels in the PartNet hierarchy. Specifically, for each shape, we traverse every path of the groundtruth part hierarchy and stop at any level randomly. We have 3736 chairs, 2431 tables, 704 cabinets in Level3 and 4664 chairs, 5987 tables, 888 cabinets in Levelmixed.
For the input image, we render a set of images the PartNet models with ShapeNet textures [4]. We then compute the worldtocamera matrix accordingly and obtain the groundtruth 3D object position in the camera space, which is used for supervising partinstance segmentation supervision. For the input point cloud, we use Furthest Point Sampling (FPS) to sample points over the each part mesh. We then normalize them following the descriptions in Sec. 3. After parts are normalized, we detect geometrically equivalent classes of parts by first filtering out parts comparing dimensions of AABB under a threshold of 0.1. We further process the remaining parts computing all possible pairwise part Chamfer distance normalized by their average diagonal length under a handpicked threshold of 0.02.
To evaluate the part assembly performance, we use two metrics: part accuracy and shape Chamfer distance. The community of object pose estimation usually uses metrics such as 5degree5cm. However, finegrained part segments usually show abundant pose ambiguity. For example, a chair leg may be simply a cylinder which has a full rotational and reflective symmetry. Thus, we introduce the part accuracy metric that leverages Chamfer distance between the part point clouds after applying the predicted part pose and the ground truth pose to address such ambiguity. Following previously defined notation in Section 4.3, we define the Part Accuracy Score (PA) as follows and set a threshold of .
(11) 
Borrowing the evaluation metric heavily used in the community of 3D object reconstruction, we also measure the
shape Chamfer distance from the predicted assembled shape to the groundtruth assembly. Formally, we define the shape Chamfer distance metric borrowing notations defined in Section 4.3 as follows.(12) 
We compare our approach to three baseline methods. Since there is no direct comparison from previous works that address the exactly same task, we try to adapt previous works on partbased shape generative modeling [70, 55, 40, 43] to our setting and compare with them. Most of these works require known part semantics and thus perform partaware shape generation without the input part conditions. However, in our task, there is no assumption for part semantics or part priors, and thus all methods must explicitly take the part input point clouds as input conditions. We train all three baselines with the same pose loss used in our method defined in Section 4.3.
Sequential Pose Proposal (BGRU) The first baseline is a sequential model, similar to the method proposed by [70, 55], instead of sequentially generating parts, we sequentially decode candidate possible poses for a given part geometry, conditioned on an image. For each input part, if there is geometrically equivalent parts , where , we take the first n pose proposal generated using GRU, and conduct Hungarian matching to match with the ground truth part poses.
Modality  Method  Part Accuracy  Assembly CD  
Chair  Table  Cabinet  Chair  Table  Cabinet  
Level3  BGRU  0.310  0.574  0.334  0.107  0.057  0.062 
BInsOnehot  0.173  0.507  0.295  0.130  0.064  0.065  
BGlobal  0.170  0.530  0.339  0.125  0.061  0.065  
Ours  0.454  0.716  0.402  0.067  0.037  0.050  
Mixed  BGRU  0.326  0.567  0.283  0.101  0.070  0.066 
BInsOnehot  0.286  0.572  0.320  0.108  0.067  0.061  
BGlobal  0.337  0.619  0.290  0.093  0.062  0.0677  
Ours  0.491  0.778  0.483  0.065  0.037  0.043 
Instance Onehot Pose Proposal (BInsOneHot) The second baseline uses MLP to directly infer pose for a given part from its geometry and the input image, similar to previous works [40, 43] that output box abstraction for shapes. Here, instead of predicting a box for each part, we predict a 6D part pose . We use instance onehot features to differentiate between the equivalent part point clouds, and conduct Hungarian matching to match with the ground truth part poses regardless of the onehot encoding.
Global Feature Model (BGlobal) The third baseline is proposed by improving upon the second baseline by adding our the contextaware 3D part feature defined in Section 4.1. Each part pose proposal not only considers the partspecific 3D feature and the 2D image feature, but also a 3D global feature obtained by aggregating the all 3D part feature then maxpool to a global 3D feature containing information of all parts. This baseline shares similar ideas to PAGENet [32] and CompoNet [52] that also compute global features to assemble each of the generated parts.
Modality  Method  Part Accuracy (Visible)  Part Accuracy (Invisible)  
Chair  Table  Cabinet  Chair  Table  Cabinet  
Level3  BGRU  0.3182  0.598  0.353  0.206  0.481  0.304 
BInsOnehot  0.178  0.572  0.291  0.104  0.369  0.289  
BGlobal  0.174  0.563  0.354  0.120  0.427  0.269  
Ours  0.471  0.753  0.455  0.270  0.557  0.358  
Mixed  BGRU  0.335  0.593  0.302  0.180  0.267  0.258 
BInsOnehot  0.295  0.592  0.346  0.133  0.275  0.279  
BGlobal  0.334  0.638  0.320  0.184  0.349  0.227  
Ours  0.505  0.803  0.537  0.262  0.515  0.360 
We compare with the three baselines and observe that our method outperforms the baseline methods both qualitatively and quantitatively using the two evaluation metrics, PA and SC. We show significant improvement for occluded part pose hallucination as Table 2 demonstrates. Qualitatively, we observe that our method can learn to infer part poses for invisible parts by (1) learning a category prior and (2) leveraging visible parts of the same geometric equivalent class. Our network can reason the stacked placement structure of cabinets as shown in the last row in Fig 3. The input image does not reveal the inner structure of the cabinet and our proposed approach learns to vertically distribute the geometrically equivalent boards to fit inside the cabinet walls, similar to the ground truth shape instance. The top row of Fig 3 demonstrates how our network learns to place the occluded back bar along the visible ones. This could be contributed to our first stage of graph convolution where we leverage visible parts to infer the pose for occluded parts in the same geometrically equivalent class.
Our method demonstrates the most faithful part pose prediction for the shape instance depicted by the input image. As shown in Fig 3 row (e), our method equally spaces the board parts vertically, which is consistent with the shape structure revealed by the input image. This is likely resulted from our partinstance image segmentation module where we explicitly predict a 2D3D grounding, whereas the baseline methods lack such components, and we further demonstrate its effectiveness with an ablation experiments.
However, our proposed method has its limitations in dealing with unusual image views, exotic shape instance, and shapes composed of only one type of part geometry, which result in noisy mask prediction. The 2D3D grounding error cascades to later network modules resulting in poor pose predictions. As shown in Fig 4 row (a), the image view is not very informative of the shape structure, making it difficult to leverage 3D geometric cues to find 2D3D grounding. Additionally, this chair instance itself is foreign to Chair category. We avoided employing differentiable rendering because it does not help address such failure cases. Fig 4 row (b) reflects a case where a shape instance is composed of a single modality of part geometry. Geometric affinity of the board parts makes it difficult for the network to come to a determinant answer for the segmentation prediction, resulting in a suboptimal part pose prediction. These obstacles arise from the task itself that all baselines also suffer from the same difficulties.
Ablation Experiments We conduct several ablation experiments on our proposed method and losses trained on PartNet Chair Level3. Table 3 in Appendix demonstrates the effectiveness of each ablated component. The partinstance image segmentation module plays the most important role in our pipeline. Removing it results in the most significant performance decrease.
We formulated a novel problem of singleimageguided 3D part assembly and proposed a neuralnetbased pipeline for the task that leverages information from both 2D grounding and 3D geometric reasoning. We established a test bed on the PartNet dataset. Quantitative evaluation demonstrates that the proposed method achieves a significant improvement upon three baseline methods. For the future works, one can study how to leverage multiple images or 3D partial scans as inputs to achieve better results. We also do not explicitly consider the connecting junctions between parts (e.g. pegs and holes) in our framework, which are strong constraints for realworld robotic assembly.
We thank the Vannevar Bush Faculty Fellowship and the grants from the Samsung GRO program and the SAIL Toyota Research Center for supporting the authors with their research, but this article reflects only the opinions and conclustions of its authors. We also thank Autodesk and Adobe for the research gifts provided.
This document provides supplementary materials accompanying the main paper, including
Ablation Experiments
Discussion of failure cases and future works;
More Architecture Details;
More Qualitative Examples.






w/o L2 Rotation loss 
0.426  0.445  0.207  0.070  
w/o Segmentation  0.363  0.378  0.164  0.084  
w/o Graph Conv 1, 2  0.403  0.423  0.178  0.073  
w/o Graph Conv 2  0.434  0.456  0.239  0.073  
w/o Image Feature  0.403  0.419  0.208  0.077  
w/o Global Feature  0.418  0.437  0.202  0.072  
Ours  Full  0.454  0.470  0.270  0.067 
Disconnected Parts We notice that our prediction on very finegrained instances sometimes results in unconnected parts. The assembly setting requires the physical constraint that each part must be in contact with another part. However, the implicit soft constraint enforced using the second stage graph graph convolution is not sufficient enough for this task. Ideally, the translation and rotation predicted for each part is only valid if they can transform the part to be in contact at the joints between relevant parts. For example, in Figure 5 we can see that the back of the chair base bars does not connect. We plan to address this problem in future works by explicitly enforcing contact between parts in a range of contact neighborhood.
Geometric Reasoning Additionally, though our current proposed method makes many design choices geared for geometric reasoning between fitting of parts, however, we still see some cases that the fitting between parts is not yet perfect. For example, in Figure 5
, We can see that the back pad does not fit perfectly into the back frame bar. This problem need to be addressed in future work where the method design should discover some pairwise or tripletlevel geometric properties that allow fitting between parts.


layer  configuration 


UNet Encoding  


1  Conv2D (3, 32, 3, 1, 1), ReLU, BN, 
Conv2D (32, 32, 3, 1, 1), ReLU, BN,  
2  Conv2D (32, 64, 3, 1, 1), ReLU, BN, 
Conv2D (64, 64, 3, 1, 1), ReLU, BN,  
3  Conv2D (64, 128, 3, 1, 1), ReLU, BN, 
Conv2D (128, 128, 3, 1, 1), ReLU, BN,  
4  Conv2D (128, 256, 3, 1, 1), ReLU, BN, 
Conv2D (256, 256, 3, 1, 1), ReLU, BN,  
5  Conv2D (256, 512, 3, 1, 1), ReLU, BN, 
Conv2D (512, 512, 3, 1, 1), ReLU, BN,  


UNet Decoding  


1  ConvTranspose2D(1301, 256, 2, 2) 
2  ConvTranspose2D(256, 128, 2, 2) 
3  ConvTranspose2D(128, 64 , 2, 2) 
4  ConvTranspose2D(64, 32, 2, 2) 
5  ConvTranspose2D(32, 1, 1, 1) 


PointNet  


1  Conv1D (3, 64, 1, 1), BN, ReLU 
2  Conv1D (64, 64, 1, 1), BN, ReLU 
3  Conv1D (64, 64, 1, 1), BN, ReLU 
4  Conv1D (64, 128, 1, 1), BN, ReLU 
5  Conv1D (128, 512, 1, 1), BN, ReLU 


SLP1  


1  FC (512, 256), ReLU, MaxPool1D 


SLP2  


1  FC (256, 256), ReLU 



layer  configuration 


SLP 3  


1  FC(1301, 256), ReLU 


Pose Decoder 2  


1  FC (1301, 256), ReLU 
2  FC(256, 3) 
3  FC(256, 4) 


SLP 4  


1  FC(1031, 256), ReLU 


Pose Decoder 2  


1  FC (1031, 256), ReLU 
2  FC(256, 3) 
3  FC(256, 4) 

European conference on computer vision
, pp. 536–551. Cited by: §2.Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 3364–3372. Cited by: §2.A genetic algorithm for robotic assembly line balancing
. European Journal of Operational Research 168 (3), pp. 811–825. Cited by: §1.PointNet: deep learning on point sets for 3d classification and segmentation
. In CVPR, Cited by: §4.1.