Autonomous assembly is a crucial capability for robots in many applications. For this task, several problems such as obstacle avoidance, motion planning, and actuator control have been extensively studied in robotics. However, when it comes to task specification, the space of possibilities remains underexplored. Towards this end, we introduce a novel problem, single-image-guided 3D part assembly, along with a learningbased solution. We study this problem in the setting of furniture assembly from a given complete set of parts and a single image depicting the entire assembled object. Multiple challenges exist in this setting, including handling ambiguity among parts (e.g., slats in a chair back and leg stretchers) and 3D pose prediction for parts and part subassemblies, whether visible or occluded. We address these issues by proposing a two-module pipeline that leverages strong 2D-3D correspondences and assembly-oriented graph message-passing to infer part relationships. In experiments with a PartNet-based synthetic benchmark, we demonstrate the effectiveness of our framework as compared with three baseline approaches.READ FULL TEXT VIEW PDF
Autonomous assembly of objects is an essential task in robotics and 3D
Autonomous part assembly is a challenging yet crucial task in 3D compute...
Complex and skillful motions in actual assembly process are challenging ...
This paper proposes a novel assembly planner for a manipulator which can...
Most of industrial robotic assembly tasks today require fixed initial
In recent years, many learning based approaches have been studied to rea...
An autonomous system is presented to solve the problem of in space assem...
The important and seemingly straightforward task of furniture assembly presents serious difficulties for autonomous robots. A general robotic assembly task consists of action sequences incorporating the following stages: (1) picking up a particular part, (2) moving it to a desired 6D pose, (3) mating it precisely with the other parts, (4) returning the manipulator to a pose appropriate for the next pick-up movement. Solving such a complicated high-dimensional motion planning problem [25, 21] requires considerable time and engineering effort. Current robotic assembly solutions first determine the desired 6D pose of parts  and then hard-code the motion trajectories for each specific object . Such limited generalizability and painstaking process planning fail to meet demands for fast and flexible industrial manufacturing and household assembly tasks .
To generate smooth and collision-free motion planning and control solutions, it is required to accurately predict 6D poses of parts in 3D space [54, 27]. We propose a 3D part assembly task whose output can reduce the complexity of the high-dimensional motion planning problem. We aim to learn generalizable skills that allow robots to autonomously assemble unseen objects from parts 
. Instead of hand-crafting a fixed set of rules to assemble one specific chair, for example, we explore category-wise structural priors that helps robots to assemble all kinds of chairs. The shared part relationships across instances in a category not only suggest potential pose estimation solutions for unseen objects but also lead to possible generalization ability for robotic control policies[64, 53, 42, 60].
We introduce the task of single-image-guided 3D part assembly: inducing 6D poses of the parts in 3D space  from a set of 3D parts and an image depicting the complete object. Robots can acquire geometry information for each part using 3D sensing, but the only information provided for the entire object shape is the instruction image. Different from many structure-aware shape modeling works [40, 71, 17, 62, 70, 32, 52], we do not assume any specific granularity or semantics of the input parts, since the given furniture parts may not belong to any known part semantics and some of the parts may be provided pre-assembled into bigger units. We also step away from instruction manuals illustrating the step-by-step assembling process, as teaching machines to read sequential instructions depicted with natural languages and figures is still a hard problem.
At the core of the task lie several challenges. First, some parts may have similar geometry. For example, distinguishing the geometric subtlety of chair leg bars, stretcher bars, and back bars is a difficult problem. Second, 3D geometric reasoning is essential in finding a joint global solution, where every piece fits perfectly in the puzzle. Parts follow a more rigid relationship graph which determines a unique final solution that emerges from the interactions between the geometries of the parts. Third, the image grounds and selects one single solution from all possible part combinations that might all be valid for the generative task. Thus, the problem is at heart a reconstruction task where the final assembly needs to agree to the input image. Additionally, and different from object localization tasks, the 3D Part Assembly Task must locate all input parts, not only posing the parts visible in the image, but also hallucinating poses for the invisible ones by leveraging learned data priors. One can think of having multiple images to expose all parts to the robot, but this reduces the generalizability to real-world scenarios, and might not be easy to achieve. Thus, we focus on solving the task of single-image and category-prior-guided pose prediction.
In this paper, we introduce a learning-based method to tackle the proposed single-image-guided 3D part assembly problem. Given the input image and a set of 3D parts, we first focus on 2D structural guidance by predicting an part-instance image segmentation to serve as a 2D-3D grounding for the downstream pose prediction. To enforce reasoning involving fine geometric subtleties, we have designed a context-aware 3D geometric feature to help the network reason about each part pose, conditioned on the existence of other parts, which might be of similar geometry. Building on the 2D structural guidance, we can generate a pose proposal for each visible part and leverage these predictions to help hallucinate poses for invisible parts as well. Specifically, we use a part graph network, based on edges to encode different relationships among parts, and design a two-phase message-passing mechanism to take part relationship constraints into consideration in the assembly.
To best of our knowledge, we are the first to assemble unlabeled 3D parts with a single image input. We set up a testbed of the problem on the recently released PartNet  dataset. We pick three furniture categories with large shape variations that require part assembly: Chair, Table and Cabinet. We compare our method with several baseline methods to demonstrate the effectiveness of our approach. We follow the PartNet official train-test splits and evaluate all model performances on the unseen test shapes. Extensive ablation experiments also demonstrate the effectiveness and necessity of the proposed modules: 2D-mask-grounding component and the 3D-message-passing reasoning component.
In summary, our contributions are:
we formulate the task of single-image-guided 3D part assembly;
we propose a two-module method, consisting of a part-instance image segmentation network and an assembly-aware part graph convolution network;
we compare with three baseline methods and conduct ablation studies demonstrating the effectiveness of our proposed method.
We review previous works on 3D pose estimation, single-image 3D reconstruction, as well as part-based shape modeling, and discuss how they relate to our task.
3D Pose Estimation. Estimating the pose of objects or object parts is a long-standing problem with a rich literature. Early in 2001, Langley et al. 
attempted to utilize visual sensors and neural networks to predict the pose for robotic assembly tasks. Andyet al.  built an robotic system taking multi-view RGB-D images as the input and predicting 6D pose of objects for Amazon Picking Challenge. Recently, Litvak et al.  proposed a two-stage pose estimation procedure taking depth images as input. In the vision community, there is also a line of works studying instance-level object pose estimation for known instances [1, 48, 59, 28, 72, 58, 2] and category-level pose estimation [19, 44, 3, 63, 7] that can possibly deal with unseen objects from known categories. There are also works on object re-localization from scenes [76, 23, 61]. Different from these works, our task takes as inputs unseen parts without any semantic labels at the test time, and requires certain part relationships and constraints to be held in order to assemble a plausible and physically stable 3D shape.
Single-Image 3D Reconstruction. There are previous works of reconstructing 3D shape from a single image with the representations of voxel grids [10, 57, 67, 49], point clouds [15, 34, 22], meshes [65, 69], parametric surfaces , and implicit functions [8, 39, 45, 51, 74]. While one can consider employing such 2D-to-3D lifting techniques as a prior step in our assembly process so that the given parts can be matched to the predicted 3D shape, it can misguide the assembly in multiple ways. For instance, the 3D prediction can be inaccurate, and even some small geometric differences can be crucial for part pose prediction. Also, the occluded area can be hallucinated in different ways. In our case, the set of parts that should compose the object is given, and thus the poses of occluded parts can be more precisely specified. Given these, we do not leverage 3D shape generation techniques and directly predict the part poses from the input 2D image.
Part-Based Shape Modeling. 3D shapes have compositional part structures. Chaudhuri et al. , Kalogerakis et al.  and Jaiswal et al.  introduced frameworks learning probabilistic graphical models that describe pairwise relationships of parts. Chaudhuri and Koltun , Sung et al.  and Sung et al.  predict the compatibility between a part and a partial object for sequential shape synthesis by parts. Dubrovina et al. , PAGENet  and CompoNet  take the set of parts as the input and generates the shape of assembled parts. Different from these works that usually assume known part semantics or a part database, our task takes a set of unseen parts during the test time and we do not assume any provided part semantic labels.
GRASS , Im2Struct  and StructureNet  learns to generate box-abstracted shape hierarchical structures. SAGNet  and SDM-Net  learn the pairwise relationship among parts that are subsequently integrated into a latent representation of the global shape. G2LGAN autoencodes the shape of an entire object with per-point part labels, and a subsequent network in the decoding refines the geometry of each part. PQ-Net  represents a shape as a sequence of parts and generates each part at every step of the iterative decoding process. All of these works are relevant but different from ours in that we obtain the final geometry of the object not by directly decoding the latent code into part geometry but by predicting the poses of the given parts and explicitly assembling them. There are also works studying partial-to-full shape matching [35, 36, 12]. Unlike these works, we use a single image as the guidance, instead of a 3D model.
We define the task of single-image-guided 3D part assembly: given a single RGB image of size depicting a 3D object and a set of 3D part point clouds (), we predict a set of part poses in space. After applying the predicted rigid transformation to all the input parts ’s, the union of them reconstructs the 3D object . We predict output part poses in the camera space, following previous works [14, 66]. In our paper, we use Quaternion to represent rotation and use and interchangeably.
We conduct a series of pose and scale normalization on the input part point clouds to ensure synthetic-to-real generalizability. We normalize each part point cloud pose to have a zero-mean center and use a local part coordinate system computed using PCA . To normalize the global scale of all training and testing data, we compute Axis-Aligned-Bounding-Boxes (AABB) for all the parts and normalize them so that the longest box diagonal across all ’s of a shape has a unit length while preserving their relative scales. We cluster the normalized part point clouds ’s into sets of geometrically equivalent part classes , where , , etc. For example, four legs of a chair are clustered together if their geometry is identical. This process of grouping indiscernible parts is essential to resolve the ambiguity among them in our framework. is a disjoint complete set such that for every and . We denote the representative point cloud for each class .
We propose a method for the task of single-image-guided 3D part assembly, which is composed of two network modules: the part-instance image segmentation module and the part pose prediction module; see Figure 2 for the overall architecture. We first extract a geometry feature of each part from the input point cloud and generates instance-level 2D segmentation masks on the input image (). Conditioned on the predicted segmentation masks, our model then leverages both the 2D mask features and the 3D geometry features to propose 6D part poses . We explain these two network modules in the following subsections. See supplementary for the implementation details.
To induce a faithful reconstruction of the object represented in the image, we need to learn a structural layout of the input parts from the 2D input. We predict a part instance mask for each part . All part masks subject to the disjoint constraint, i.e., , where denotes a background mask. If a part is invisible, we simply predict an empty mask and let the second network to halluciate a pose leveraging contextual information and learned data priors. The task difficulties are two folds. First, the network needs to distinguish between the geometric subtlety of the input part point clouds to establish a valid 2D-3D correspondence. Second, for the identical parts within each geometrically equivalent class, we need to identify separate 2D mask regions to pinpoint their exact locations. Below, we explain how our proposed method is designed to tackle the above challenges.
Context-Aware 3D Part Features. To enable the network to reason the delicate differences between parts, we construct the context-aware 3D conditional feature (), which is computed from three components: part geometry feature
, instance one-hot vector(), and a global part contextual feature . We use PointNet  to extract a global geometry feature for each part point cloud . If a part has multiple instances within a geometrically equivalent class (e.g. four chair legs), we introduce an additional instance one-hot vector to tell them apart. For part which has only one instance, we use an one-hot vector with the first element to be 1. For contextual awareness, we extract a global feature over all the input part point clouds, to facilitate the network to distinguish between similar but not equivalent part geometries (e.g. a short bar or a long bar). Precisely, we first compute and for every part, then compute
to obtain per-part local feature, where SLP is short for Single-Layer Perception. We aggregate over all part local features via a max-pooling symmetric function to compute the global contextual feature. Finally, we define to be the context-aware 3D per-part feature.
Conditional U-Net Segmentation. We use a conditional U-Net  for the part-instance segmentation task. Preserving the standard U-Net CNN architecture, our encoder takes an 3-channel RGB image as input and produce a bottleneck feature map (). Concatenating the image feature with our context-aware 3D part conditional feature , we obtain , where we duplicate along the spatial dimensions for times. The decoder takes the conditional bottleneck feature and decodes a part mask for evert input part
. We keep skip links as introduced in the original U-Net paper between encoder and decoder layers. To satisfy the non-overlapping constraint, we add a SoftMax layer across all predicted masks, augmented with a background mask.
With the 2D grounding masks produced by the part-instance image segmentation module, we predict a 6D part pose for every input part using the part pose prediction module. We predict a unit Quaternion vector that corresponds to a 3D rotation and a translation vector denoting the part center position in the camera space.
Different from object pose estimation, the task of part assembly requires a joint prediction of all part poses. Part pose predictions should not be independent with each other, as part poses follow a set of more rigid relationships, such as symmetry and parallelism. For a valid assembly, parts must be in contact with adjacent parts. The rich part relationships restrict the solution space for each part pose. We leverage a two-phase graph convolutional neural network to address the joint communication of part poses for the task of part assembly.
Mask-Conditioned Part Features. We consider three sources of features for each part: 2D image feature , 2D mask feature (), context-aware 3D part feature . We use a ResNet-18 
pretrained on ImageNet to extract 2D image feature . We use a separate ResNet-18 that takes the 1-channel binary mask as input and extracts a 2D mask feature , where masks for invisible parts are predicted as empty. Then, finally, we propagate the 3D context-aware part feature introduced in the Sec. 4.1 that encodes 3D part geometry information along with its global context.
Two-Phase Graph Convolution. We create a part graph , treating every part as a node and propose a two-phase of graph convolution to predict the pose of each part. We first describe how we construct the edges in each phase, and then introduce our assembly-oriented graph convolution operations.
During the first phase, we draw pairwise edges among all parts in every geometrically equivalent part classes and perform graph convolution over , where
Edges in allow message passing among geometrically identical parts that are likely to have certain spatial relationships or constraints (e.g. four legs of a chair have two orthogonal reflection planes). After the first-phase graph convolution, each node has an updated node feature. The updated node feature is then decoded as an 6D pose for each part. The predicted part poses produce an initial assembled shape.
We leverage a second phase of graph convolution to refine the predicted part poses. Besides the edges in , we draw a new set of edges by finding top-5 nearest neighbors for each part based upon the initial assembly and define . The intuition here is that once we have an initial part assembly, we are able to connect the adjacent parts so that they learn to attach to each other with certain joint constraints.
We implement the graph convolution as two iterations of message passing [73, 68, 40]. Given a part graph with initial node features and edge features , each iteration of message passing starts from computing edge features
where we do not use during the first phase of graph convolution, and define if and if for the second phase. Then, we perform average-pooling over all edge features that are connected to a node and obtain the updated node feature
We define if there is no edge drawn from node . We define the final node features to be for each phase of graph convolution.
Respectively, we denote the final node feature of first phase and second phase graph convolution to be and for a part .
Part Pose Decoding. After gathering the node features after conducting the two-phase graph convolution operations as and , we use a Multiple-Layer Perception (MLP) to decode part poses at each phase.
To ensure the output of unit Quaternion prediction, we normalize the output vector length so that .
We first train the part-instance image segmentation module until its convergence and then train the part pose prediction module. Empirically, we find that having a good mask prediction is necessary before training for the part poses.
Loss for Part-Instance Image Segmentation. We adapt the negative soft-iou loss from  to supervise the training of the part-instance image segmentation module. We perform Hungarian matching  within each geometrically equivalent class to guarantee that the loss is invariant to the order of part poses in ground-truth and prediction. The loss is defined as
where and denote the ground truth and the matched predicted mask. refers to the matching results that match ground-truth part indices to the predicted ones. includes all 2D index ’s on a image plane.
Losses for Part Pose Prediction. For the pose prediction module, we design an order-invariant loss by conducting Hungarian matching within each geometry-equivalent classes . Additionally, we observe that separating supervision loss for translation and rotation helps stabilize training. We use the following training loss for the pose prediction module.
We use the Euclidean distance to measure the difference between the 3D translation prediction and ground truth translation for each part. We denote as the matching results.
where and denote the matched predicted translation and the ground truth 3D translation. We use weight parameter of in training.
We use two losses for rotation prediction: Chamfer distance  and distance . Because many parts have symmetric geometry (e.g. bars and boards) which results in multiple rotation solutions, we use Chamfer distance as the primary supervising loss to address such pose ambiguity. Given the point cloud of part , the ground truth rotation , and the matched predicted rotation , the Chamfer distance loss is defined as
where and denote the rotated part point clouds using and respectively. We use for the Chamfer loss. Some parts may be not perfectly symmetric (e.g. one bar that has small but noticeable different geometry at two ends), using Chamfer distance by itself in this case would make the network fall into local minima. We encourage the network to correct this situation by penalizing the distance between the matched predicted rotated point cloud and the ground truth rotated point cloud in Euclidean distance.
where denotes the Frobenius norm, is the number of points per part. Note that on its own is not sufficient in cases when the parts are completely symmetric. Thus, we add the loss as a regularizing term with a smaller weight of . We conducted an ablation experiment demonstrating the loss contributes to correcting rotation for some parts.
Finally, we compute a shape holistic Chamfer distance as the predicted assembly should be close to the ground truth Chamfer distance.
where denotes the predicted assembled shape point cloud and denotes the ground truth shape point cloud. This loss encourages the holistic shape appearance and the part relationships to be close to the ground-truth. We use .
In this section, we set up the testbed for the proposed single-image-guided 3D part assembly problem on the PartNet  dataset. To validate the proposed approach, we compare against three baseline methods. Both qualitative and quantitative results demonstrate the effectiveness of our method.
Recently, Mo et. al.  proposed the PartNet dataset, which is the largest 3D object dataset with fine-grained and hierarchical part annotation. Every PartNet object is provided with a ground-truth hierarchical part instance-level semantic segmentation, from coarse to fine-grained levels , which provides a good complexity of parts. In our work, we use the three largest furniture categories that the requires real-world assembly: Chair, Table and Cabinet. We follow the official PartNet train/validation/test split (roughly ) and filter out the shapes with more than 20 parts.
For each object category, we create two data modalities: Level-3 and Level-mixed. The Level-3 corresponds to the most fine-grained PartNet segmentation. While we do not assume known part semantics, an algorithm can implicitly learn the semantic priors dealing with the Level-3 data, which is undesired in our goal of generalizing to real-life assembly settings, as it is unrealistic to assume taht IKEA furnitures also follow the PartNet same semantics. To enforce the network to reason with part geometries, we created an additional category modality, Level-mixed, which contains part segmentation at all levels in the PartNet hierarchy. Specifically, for each shape, we traverse every path of the ground-truth part hierarchy and stop at any level randomly. We have 3736 chairs, 2431 tables, 704 cabinets in Level-3 and 4664 chairs, 5987 tables, 888 cabinets in Level-mixed.
For the input image, we render a set of images the PartNet models with ShapeNet textures . We then compute the world-to-camera matrix accordingly and obtain the ground-truth 3D object position in the camera space, which is used for supervising part-instance segmentation supervision. For the input point cloud, we use Furthest Point Sampling (FPS) to sample points over the each part mesh. We then normalize them following the descriptions in Sec. 3. After parts are normalized, we detect geometrically equivalent classes of parts by first filtering out parts comparing dimensions of AABB under a threshold of 0.1. We further process the remaining parts computing all possible pairwise part Chamfer distance normalized by their average diagonal length under a hand-picked threshold of 0.02.
To evaluate the part assembly performance, we use two metrics: part accuracy and shape Chamfer distance. The community of object pose estimation usually uses metrics such as 5-degree-5-cm. However, fine-grained part segments usually show abundant pose ambiguity. For example, a chair leg may be simply a cylinder which has a full rotational and reflective symmetry. Thus, we introduce the part accuracy metric that leverages Chamfer distance between the part point clouds after applying the predicted part pose and the ground truth pose to address such ambiguity. Following previously defined notation in Section 4.3, we define the Part Accuracy Score (PA) as follows and set a threshold of .
Borrowing the evaluation metric heavily used in the community of 3D object reconstruction, we also measure theshape Chamfer distance from the predicted assembled shape to the ground-truth assembly. Formally, we define the shape Chamfer distance metric borrowing notations defined in Section 4.3 as follows.
We compare our approach to three baseline methods. Since there is no direct comparison from previous works that address the exactly same task, we try to adapt previous works on part-based shape generative modeling [70, 55, 40, 43] to our setting and compare with them. Most of these works require known part semantics and thus perform part-aware shape generation without the input part conditions. However, in our task, there is no assumption for part semantics or part priors, and thus all methods must explicitly take the part input point clouds as input conditions. We train all three baselines with the same pose loss used in our method defined in Section 4.3.
Sequential Pose Proposal (B-GRU) The first baseline is a sequential model, similar to the method proposed by [70, 55], instead of sequentially generating parts, we sequentially decode candidate possible poses for a given part geometry, conditioned on an image. For each input part, if there is geometrically equivalent parts , where , we take the first n pose proposal generated using GRU, and conduct Hungarian matching to match with the ground truth part poses.
|Modality||Method||Part Accuracy||Assembly CD|
Instance One-hot Pose Proposal (B-InsOneHot) The second baseline uses MLP to directly infer pose for a given part from its geometry and the input image, similar to previous works [40, 43] that output box abstraction for shapes. Here, instead of predicting a box for each part, we predict a 6D part pose . We use instance one-hot features to differentiate between the equivalent part point clouds, and conduct Hungarian matching to match with the ground truth part poses regardless of the onehot encoding.
Global Feature Model (B-Global) The third baseline is proposed by improving upon the second baseline by adding our the context-aware 3D part feature defined in Section 4.1. Each part pose proposal not only considers the part-specific 3D feature and the 2D image feature, but also a 3D global feature obtained by aggregating the all 3D part feature then max-pool to a global 3D feature containing information of all parts. This baseline shares similar ideas to PAGENet  and CompoNet  that also compute global features to assemble each of the generated parts.
|Modality||Method||Part Accuracy (Visible)||Part Accuracy (Invisible)|
We compare with the three baselines and observe that our method outperforms the baseline methods both qualitatively and quantitatively using the two evaluation metrics, PA and SC. We show significant improvement for occluded part pose hallucination as Table 2 demonstrates. Qualitatively, we observe that our method can learn to infer part poses for invisible parts by (1) learning a category prior and (2) leveraging visible parts of the same geometric equivalent class. Our network can reason the stacked placement structure of cabinets as shown in the last row in Fig 3. The input image does not reveal the inner structure of the cabinet and our proposed approach learns to vertically distribute the geometrically equivalent boards to fit inside the cabinet walls, similar to the ground truth shape instance. The top row of Fig 3 demonstrates how our network learns to place the occluded back bar along the visible ones. This could be contributed to our first stage of graph convolution where we leverage visible parts to infer the pose for occluded parts in the same geometrically equivalent class.
Our method demonstrates the most faithful part pose prediction for the shape instance depicted by the input image. As shown in Fig 3 row (e), our method equally spaces the board parts vertically, which is consistent with the shape structure revealed by the input image. This is likely resulted from our part-instance image segmentation module where we explicitly predict a 2D-3D grounding, whereas the baseline methods lack such components, and we further demonstrate its effectiveness with an ablation experiments.
However, our proposed method has its limitations in dealing with unusual image views, exotic shape instance, and shapes composed of only one type of part geometry, which result in noisy mask prediction. The 2D-3D grounding error cascades to later network modules resulting in poor pose predictions. As shown in Fig 4 row (a), the image view is not very informative of the shape structure, making it difficult to leverage 3D geometric cues to find 2D-3D grounding. Additionally, this chair instance itself is foreign to Chair category. We avoided employing differentiable rendering because it does not help address such failure cases. Fig 4 row (b) reflects a case where a shape instance is composed of a single modality of part geometry. Geometric affinity of the board parts makes it difficult for the network to come to a determinant answer for the segmentation prediction, resulting in a sub-optimal part pose prediction. These obstacles arise from the task itself that all baselines also suffer from the same difficulties.
Ablation Experiments We conduct several ablation experiments on our proposed method and losses trained on PartNet Chair Level-3. Table 3 in Appendix demonstrates the effectiveness of each ablated component. The part-instance image segmentation module plays the most important role in our pipeline. Removing it results in the most significant performance decrease.
We formulated a novel problem of single-image-guided 3D part assembly and proposed a neural-net-based pipeline for the task that leverages information from both 2D grounding and 3D geometric reasoning. We established a test bed on the PartNet dataset. Quantitative evaluation demonstrates that the proposed method achieves a significant improvement upon three baseline methods. For the future works, one can study how to leverage multiple images or 3D partial scans as inputs to achieve better results. We also do not explicitly consider the connecting junctions between parts (e.g. pegs and holes) in our framework, which are strong constraints for real-world robotic assembly.
We thank the Vannevar Bush Faculty Fellowship and the grants from the Samsung GRO program and the SAIL Toyota Research Center for supporting the authors with their research, but this article reflects only the opinions and conclustions of its authors. We also thank Autodesk and Adobe for the research gifts provided.
This document provides supplementary materials accompanying the main paper, including
Discussion of failure cases and future works;
More Architecture Details;
More Qualitative Examples.
w/o L2 Rotation loss
|w/o Graph Conv 1, 2||0.403||0.423||0.178||0.073|
|w/o Graph Conv 2||0.434||0.456||0.239||0.073|
|w/o Image Feature||0.403||0.419||0.208||0.077|
|w/o Global Feature||0.418||0.437||0.202||0.072|
|Ours - Full||0.454||0.470||0.270||0.067|
Disconnected Parts We notice that our prediction on very fine-grained instances sometimes results in unconnected parts. The assembly setting requires the physical constraint that each part must be in contact with another part. However, the implicit soft constraint enforced using the second stage graph graph convolution is not sufficient enough for this task. Ideally, the translation and rotation predicted for each part is only valid if they can transform the part to be in contact at the joints between relevant parts. For example, in Figure 5 we can see that the back of the chair base bars does not connect. We plan to address this problem in future works by explicitly enforcing contact between parts in a range of contact neighborhood.
Geometric Reasoning Additionally, though our current proposed method makes many design choices geared for geometric reasoning between fitting of parts, however, we still see some cases that the fitting between parts is not yet perfect. For example, in Figure 5
, We can see that the back pad does not fit perfectly into the back frame bar. This problem need to be addressed in future work where the method design should discover some pairwise or triplet-level geometric properties that allow fitting between parts.
Conv2D (3, 32, 3, 1, 1), ReLU, BN,
|Conv2D (32, 32, 3, 1, 1), ReLU, BN,|
|2||Conv2D (32, 64, 3, 1, 1), ReLU, BN,|
|Conv2D (64, 64, 3, 1, 1), ReLU, BN,|
|3||Conv2D (64, 128, 3, 1, 1), ReLU, BN,|
|Conv2D (128, 128, 3, 1, 1), ReLU, BN,|
|4||Conv2D (128, 256, 3, 1, 1), ReLU, BN,|
|Conv2D (256, 256, 3, 1, 1), ReLU, BN,|
|5||Conv2D (256, 512, 3, 1, 1), ReLU, BN,|
|Conv2D (512, 512, 3, 1, 1), ReLU, BN,|
|1||ConvTranspose2D(1301, 256, 2, 2)|
|2||ConvTranspose2D(256, 128, 2, 2)|
|3||ConvTranspose2D(128, 64 , 2, 2)|
|4||ConvTranspose2D(64, 32, 2, 2)|
|5||ConvTranspose2D(32, 1, 1, 1)|
|1||Conv1D (3, 64, 1, 1), BN, ReLU|
|2||Conv1D (64, 64, 1, 1), BN, ReLU|
|3||Conv1D (64, 64, 1, 1), BN, ReLU|
|4||Conv1D (64, 128, 1, 1), BN, ReLU|
|5||Conv1D (128, 512, 1, 1), BN, ReLU|
|1||FC (512, 256), ReLU, MaxPool1D|
|1||FC (256, 256), ReLU|
|1||FC(1301, 256), ReLU|
|Pose Decoder 2|
|1||FC (1301, 256), ReLU|
|1||FC(1031, 256), ReLU|
|Pose Decoder 2|
|1||FC (1031, 256), ReLU|
European conference on computer vision, pp. 536–551. Cited by: §2.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3364–3372. Cited by: §2.
A genetic algorithm for robotic assembly line balancing. European Journal of Operational Research 168 (3), pp. 811–825. Cited by: §1.
PointNet: deep learning on point sets for 3d classification and segmentation. In CVPR, Cited by: §4.1.