Category-Level Articulated Object Pose Estimation

12/26/2019 ∙ by Xiaolong Li, et al. ∙ 7

This paper addresses the task of category-level pose estimation for articulated objects from a single depth image. We present a novel category-level approach that correctly accommodates object instances not previously seen during training. A key aspect of the work is the new Articulation-Aware Normalized Coordinate Space Hierarchy (A-NCSH), which represents the different articulated objects for a given object category. This approach not only provides the canonical representation of each rigid part, but also normalizes the joint parameters and joint states. We developed a deep network based on PointNet++ that is capable of predicting an A-NCSH representation for unseen object instances from single depth input. The predicted A-NCSH representation is then used for global pose optimization using kinematic constraints. We demonstrate that constraints associated with joints in the kinematic chain lead to improved performance in estimating pose and relative scale for each part of the object. We also demonstrate that the approach can tolerate cases of severe occlusion in the observed data. Project webpage



There are no comments yet.


page 1

page 3

page 8

page 12

page 13

Code Repositories


[CVPR 2020, Oral] Category-Level Articulated Object Pose Estimation

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Category-level articulated object pose estimation. Given a single depth image of a novel articulated object from a known category, the goal of the algorithm is to estimate detailed per-part poses, segmentation, amodal bounding boxes, as well as joint parameters and joint states of the object.

Our environment is populated with articulated objects, ranging from furniture such as cabinets or ovens to small tabletop objects such as laptops or eyeglasses. Effectively interacting with these objects requires a detailed understanding of their articulation states and part-level poses. Such understanding is beyond the scope of typical 6D pose estimation algorithms, which have been designed for rigid objects [xiang2017posecnn, tremblay2018deep, sundermeyer2018implicit, wang2019normalized]. Algorithms that do consider object articulations [katz2008manipulating, katz2013interactive, hausman2015active, martin2016integrated] often require the exact object CAD model and the associated joint parameters at test time, preventing them from generalizing to new object instances.

In this paper, we focus on the task of category-level pose estimation for articulated objects from a single depth image – a task that aims at producing the detailed per-part pose and scale, joint parameters, and joint states of a novel articulated object instance from a known category. An overview is shown in Figure 1. To achieve this goal, several major challenges need to be addressed:

First, to handle a new object instance without a precise 3D CAD model, the algorithm needs to define a shared object representation that can accommodate different object instances within a given category. The representation needs to be able to generalize to different variations of part geometry, joint parameters, and self-occlusion patterns.

Second, in contrast to rigid objects, the pose of articulated objects naturally requires a much higher degree of freedom. Depending on the length of kinematic chain, the degrees of freedom in the target pose representation will vary significantly. It becomes a challenging problem to accurately estimate pose in such a high-dimensional space while being faithful to physical constraints.

Third, various joint types present different articulation characteristics along with different physical constraints and priors. We are particularly interested in the two most common joint types, revolute joints that cause rotational motion (e.g., door hinges), and prismatic joints that allow translational movement (e.g., drawers in a cabinet). Designing a framework that can consider and leverage both joint types effectively is still an open research problem.

To address the first representation challenge, we propose a shared category-level representation for different articulated object instances, which we call Articulation-Aware Normalized Coordinate Space Hierarchy (A-NCSH). Concretely, A-NCSH represents different articulated objects in a “canonical” space, which normalizes part shapes, joint parameters, and joint states. The pose of each rigid part and the state of each joint in the depth image can then be defined with respect to their normalized states in A-NCSH. This representation provides us a way to define category-level pose even for unseen object instances regardless of intra-class variance.

To address the second pose estimation challenge, which involves high dimensionality, we segment objects into rigid parts and estimate the pose on a per-part basis. However, separate per-part pose estimation could easily lead to physically impossible solutions since joint constraints are not considered and the estimation may not conform with the actual degrees of freedom. To cope with this issue, we treat joints as “first-class citizens.” Our approach estimates joint parameters, and leverages the induced kinematic priors to constrain the pose of each part. We formulate articulated pose fitting from the A-NCSH space as a combined optimization problem, taking both rigid part pose fitting and joint constraints into consideration.

To handle both revolute and prismatic joints, each joint is associated with a unique index, and our system explicitly predicts joint index and the corresponding joint parameters. We model the constraints introduced by each type of joint mathematically with the predicted parameters, and selectively use these constraints during the pose fitting stage according to the associated joint type. Although we focus only on revolute and prismatic joints in our work, the approach is sufficiently general to support other types of joints.

In summary, the primary contribution of our paper is the formulation of a unified framework for the task of category-level articulated pose estimation that is able to handle previously unseen object instances and multiple articulation types. In support of this framework, we designed:

  • A new category-level representation for articulated objects – Articulation-Aware Normalized Coordinate Space Hierarchy (A-NCSH).

  • A PointNet++ based neural network that is capable of predicting A-NCSH representations for unseen articulated object instances from single depth input.

  • A strategy for global optimization that leverages kinematic constraints along with all the information in the A-NCSH representation to improve the overall pose estimation accuracy.

Our experiments demonstrate that the A-NCSH representation and the global optimization using A-NCSH prediction lead to improved performance in both part pose prediction and joint parameter estimation.

2 Related Work

This section summarizes related work on pose estimation for rigid and articulated objects.

Rigid object pose estimation.

Classically, the goal of pose estimation is to infer an object’s 6D pose (3D rotation and 3D location) relative to a given reference frame. Most previous work has focused on estimating instance-level pose by assuming that exact 3D CAD models are available. For example, traditional algorithms such as iterative closest point (ICP) [besl1992method] perform template matching by aligning the CAD model with an observed 3D point cloud. Another family of approaches aim to regress the object coordinates onto its CAD model for each observed object pixel, and then use voting to solve for object pose [brachmann2014learning, brachmann2016uncertainty]. These approaches are limited by the need to have exact CAD models for particular object instances.

Category-level pose estimation aims to infer an object’s pose and scale relative to a category-specific canonical representation. Recently, Wang et al. [wang2019normalized] extended the object coordinate based approach to perform category-level pose estimation. The key idea behind the intra-category generalization is to regress the coordinates within a Normalized Object Coordinate Space (NOCS), where the sizes are normalized and the orientations are aligned for objects in a given category. Whereas the work by [wang2019normalized] focuses on pose and size estimation for rigid objects, the work presented here extends the NOCS concept to accommodate articulated objects at both part and object level. In addition to pose, our work also infers joint information and addresses particular problems related to occlusion.

Figure 2: Articulation-Aware Normalized Coordinate Space Hierarchy (A-NCSH) is a shared object representation for different object instances in a given category. It consists of a two-level hierarchy: at the leaf level, it uses Normalized Part Coordinate Space (NPCS) to represent each individual part; at the root level, it uses Normalized Articulated Object Coordinate Space (NAOCS), which is a single coordinate space that transforms all the NPCS-based parts to represent a complete articulated object in a pre-defined rest state. Here we show two examples of A-NCSH representation, where each point is colored according to its coordinate location in the corresponding representation (NAOCS or NPCS).

Articulated object pose estimation.

Most algorithms that attempt pose estimation for articulated objects assume that instance-level information is available. The approaches often use CAD models for particular instances along with known kinematic parameters to constrain the search space and to recover the pose separately for different parts [michel2015pose, desingh2018factored]. Michel et al. [michel2015pose]

use a random forest to vote for pose parameters on canonical body parts for each point in a depth image, followed by a variant of the Kabsch algorithm to estimate joint parameters using RANSAC-based energy minimization. Desingh

et al. [desingh2018factored] adopted a generative approach using a Markov Random Field formulation, factoring the state as individual parts constrained by their articulation parameters. However, these approaches only consider known object instances and cannot handle different part and kinematic variations.

Another line of work relies on active manipulation of an object to infer its articulation pattern [katz2008manipulating, katz2013interactive, hausman2015active, martin2016integrated, yi2018deep]. For example, Katz et al. [katz2013interactive], use a robot manipulator to interact with articulated objects as RGB-D videos are recorded. Then the 3D points are clustered into rigid parts according to their motion. Although these approaches could perform pose estimation for unknown objects, they require the input to be a sequence of images that observe an object’s different articulation states, whereas our approach is able to perform the task using a single depth observation.

Human body and hand pose estimation.

Two specific articulated classes have gained considerable attention recently: the human body and the human hand. For human pose estimation, approaches have been developed using end-to-end networks to predict 3D joint locations directly [mehta2017vnect, sun2017compositional, pavlakos2017coarse], using dense correspondence maps between 2D images and 3D surface models [alp2018densepose], or estimating full 3D shape through 2D supervision [lassner2017unite, pavlakos2018learning]. Techniques for hand pose estimation (e.g., [wan2018dense, ge2018point]) often start with per-pixel estimates, such as pixel-level segmentation, coordinate regression or joint voting. The pixel-level prediction is then aggregated to infer 3D joint coordinates. Approaches for both body and hand pose estimation are often specifically customized for those object types, relying on a fixed skeletal model with class-dependent variability (e.g., expected joint lengths) and strong shape priors (e.g., using parametric body shape model for low-dimensional parameterization). Also, such hand/body approaches accommodate only revolute joints and do not generalize well to other object types. In contrast, our algorithm is able to handle general articulated objects with any kinematic chain topology, allowing both revolute joints and prismatic joints.

3 Problem Statement

The input to the system is a 3D point cloud representing an unknown object instance from a known category, where denotes the number of points. The goal is to segment the point cloud into rigid moving parts , recover the 3D rotation, 3D translation, and size for each part , and predict the joints with their parameters and states . We consider two types of joint in this work, 1D revolute joints (e.g., door hinges) and 1D prismatic joints (e.g., drawers for a cabinet). For a revolute joint, the joint parameters include the direction of the rotation axis as well as a pivot point on the rotation axis . The state is defined as the relative rotation angle between the two connected parts, as compared with a pre-defined rest state. For a prismatic joint, the joint parameters are simply the direction of the translation axis , and the joint state is defined as the relative translation distance between the two connected parts compared with a pre-defined rest state. This representation scheme not only encodes many common articulated objects effectively and compactly, but also suggests possible articulations.

4 Method

It is challenging to define part poses and joint states for unseen object instances with potentially large geometric variations and articulation differences. To tackle this problem, we introduce Articulation-Aware Normalized Coordinate Space Hierarchy (A-NCSH), a shared object representation for different object instances in a given category. An overview is given in Figure 2. We will provide details on A-NCSH and how it allows defining part poses and joints for unseen objects in Sec. 4.1. We then present a deep neural network capable of predicting the A-NCSH representation in Sec. 4.2. Last, Sec. 4.3 describes how the A-NCSH representation is used within a voting scheme to jointly optimize part poses and joint states with explicit kinematic chain constraints.

4.1 A-NCSH Representation

Our A-NCSH representation is inspired by and closely related to Normalized Object Coordinate Space (NOCS) [wang2019normalized], which we will briefly review first. NOCS is defined as a 3D space contained within a unit cube, i.e., , and was introduced in [wang2019normalized] to estimate the category-level 6D pose and size of rigid objects. Specifically, known objects from a certain category are consistently aligned by their centers and orientations. At the same time, these objects are pre-scaled and pre-centered so that their tight bounding boxes all have a diagonal distance of 1 and are centered in the NOCS. The object pose and size can then be defined as the rigid transformation plus scaling from the NOCS coordinate to the camera space observations. NOCS provides a common reference frame for each category with a canonical global pose and size, enabling pose estimation even for unseen object instances. However, NOCS is not well-suited for articulated objects. Instead of the global pose and size, we care more about the states of rigid parts and joints, which are all ignored in NOCS.

Therefore, we need a representation which not only normalizes the pose and size of each rigid part, but also normalizes the joint parameters and states. For this purpose, we present A-NCSH, a two-level hierarchy of normalized coordinate spaces. At the leaf level, for each individual part we introduce one Normalized Part Coordinate Space (NPCS) to normalize the part pose and size. At the root level, we use a single coordinate space into which all the NPCSs are transformed so that the part coordinates represent the articulated object in a pre-defined rest state. We name the root space as Normalized Articulated Object Coordinate Space (NAOCS). NAOCS complements the set of NPCSs with normalized joint parameters as well as a rest joint state defined for each joint (Figure 2). We explain both NPCS and NAOCS in detail below.


NPCS is defined similarly to NOCS [wang2019normalized] but for single parts instead of whole objects. We use a unit cube to normalize the 6D pose and size of each rigid part. Given a set of 3D shapes from a known category, we assume they all share a similar kinematic chain with parts. We consistently segment the shapes into sets of rigid parts, and each part is represented by a separate NPCS. We use the same protocol as is in [wang2019normalized] to pre-align and pre-scale the parts, which not only allows us to define the 6D pose and size of each part but also provides a natural way to obtain the amodal 3D bounding box for each part. We define the 6D pose and size of a part as the rigid transformation plus scaling from its NPCS coordinates to its camera coordinates. This definition naturally generalizes to unseen instances once the corresponding NPCS get predicted.


NPCS is defined for each component in the kinematic chain separately, and does not consider the relationship between different parts. Therefore normalizing of joint parameters is difficult. Moreover, NPCS is not able to fully normalize the states of different types of joints. For example, for a unseen cabinet instance with prismatic joints, it is hard to determine how much a individual drawer is pulled out by using part coordinate alone. Therefore, to normalize joint parameters and joint states, we introduce NAOCS, a canonical frame that brings different NPCS coordinates together. Specifically, we transform the part coordinates in each NPCS to a global NAOCS space where different parts compose the original articulated object in a pre-defined rest state (e.g., a cabinet with a closed drawer). The rest state of the object will define the rest state of each joint. In addition, the rotation axes of revolute joints or translation axes of prismatic joints from different object instances all follow same canonical orientations and get naturally aligned in NAOCS. With these joint parameters, we can explicitly incorporate the kinematic constraints when estimating the part poses and joint states. In this work, we specify the transformation from each NPCS to NAOCS by a global translation and scaling.

Compared with NOCS, NAOCS not only normalizes the global pose and size of the articulated objects, but also normalizes their joint parameters and states. It is worth mentioning that from NAOCS alone we can not obtain the per-part amodal 3D bounding box. Therefore, to generate a full state description for an articulated object, we need to use NAOCS together with multiple NPCS cases, namely our A-NCSH representation.

Figure 3: A-NCSH network example. The network uses three PointNet++ [qi2017pointnet++] modules to predict the A-NCSH representation that includes part segmentation, per-part NPCS coordinates, global transformation (scale and translation) from each NPCS to NAOCS, and joint parameters in NAOCS.

4.2 A-NCSH Network

We present a deep neural network capable of predicting the A-NCSH representation for unseen articulated object instances. As shown in Figure 3, the network accepts a point cloud as input, and it contains three modules adapted from PointNet++ [qi2017pointnet++] segmentation architectures. The prediction by the network consists of four types of information: rigid part segmentation, per-part NPCS coordinates, global transformation from each NPCS to NAOCS, and joint parameters in NAOCS.

The first module deals with part segmentation and NPCS coordinate regression. We assume that objects from the category of interest all share the same kinematic chain with parts, and we simply associate each point with one of the chain parts for segmentation purposes. Since we normalize each part with a separate NPCS, the network predicts all possible NPCS coordinates for point instead of regressing a unique one. The final NPCS coordinates will be selected using the predicted segmentation label. The segmentation branch and NPCS regression branch share the same feature backbone because the two tasks are closely related. They only differ in the heads, which have one fully-connected layers.

The second module predicts global transformations which place NPCSs properly in the NAOCS. We design NPCS and NAOCS so that the part orientation remains the same in both spaces. This allows us to only estimate a global translation and a global scaling for the -th NPCS. We use this approach through all of our experiments. The input to this module is a concatenation of the original point cloud and the final NPCS coordinates predicted from the first module. Instead of predicting a unique and for the -th NPCS, we perform a dense regression of a per-point global translation and global scaling , which tends to provide more reliable NAOCS coordinates in practice. The final per-point global translation and global scaling will be selected based on the predicted segmentation label. The NAOCS coordinates can be represented as .

The last module infers joint parameters for each joint in the NAOCS space. (We use the symbol “  ” to distinguish NAOCS-space parameters from camera-space parameters.) As we mentioned before, we are mainly interested in two types of joints: 1D revolute joint whose parameters include rotation axis direction and pivot point position , 1D prismatic joint whose parameters are translation axis direction . Joints serve as an auxiliary structure to the object geometry, and the fact that they are spatially sparse introduces challenges to the problem of parameter estimation. To cope with these challenges, we estimate joint heap maps on the point cloud data to localize joints, and we leverage voting schemes to estimate the joint parameters. To be specific, for input point cloud , we use the corresponding NAOCS coordinates to compute a per-point heat map associated with a revolute joint as follows:

where is a distance threshold which we set as in all the experiments. For a prismatic joint , assuming its child part is (we assume a known kinematic chain for each category so this is easy to obtain), we define the per-point heapmap as:

We regress one heat map for each joint and at the same time, we also perform a dense regression of the joint axis orientation or . The final predictions of the joint axis orientation will simply be the average prediction over all the points with a heat score larger than

. To predict the pivot point of a 1D revolute joint which is not uniquely defined (it could move arbitrarily along the rotation axis), points are again filtered according to their heat scores to vote for a possible answer. We perform a dense regression of a unit direction vector from each point to the rotation axis. Since the heat map for a revolute joint already indicates the distance from each point to the rotation axis, together with the predicted direction vectors, we can project each point onto the rotation axis. We simply average these projections to get a possible pivot point.

Loss functions:

We use relaxed IoU loss [yi2018deep] for part segmentation. We use mean-square loss for NPCS coordinate regression. Instead of directly supervising per-point global translation and scaling for each part , we compose NPCS coordinates from all different parts, leveraging the predicted global transformations and minimizing the mean-square difference from the ground truth NAOCS coordinates. This trick stabilizes the prediction of global transformations and improves the predicted NAOCS coordinates. We also use mean-square loss for all the joint-related predictions. Our total loss is given by

, and we set the multiplication factors to 1, 10, 10, 1 heuristically in our experiments.

Training data generation:

To train this network, we generate synthetic depth rendering using the object 3D model provided in the Shape2Motion dataset [wang2019shape2motion]. In this dataset, each object instance data contains descriptions of the object’s 3D geometry and its articulation parameters, which are both necessary for training our network. During rendering, the program automatically generates random articulation poses for each object instance, according to its joint limits. Then the depth images and corresponding ground truth masks are rendered from a set of random camera viewpoints. We also filter out camera poses where some parts of the object are completely occluded. On average, 30,000 training images for 40 different object instances are generated for each object category.

4.3 Pose Optimization with Kinematic Constraints

In the previous section, we have introduced our A-NCSH network which not only segments an 3D point cloud into rigid parts , but also predicts the NPCS coordinates , per-point global translation and global scaling , the NAOCS coordinates , and the joint parameters in the NAOCS. To fully describe the pose of an articulated object, we still need to estimate the 6D poses and sizes for all the parts, as well as the joint states and joint parameters in the camera space.

Since for each part we have its NPCS coordinates and camera space point coordinates in correspondence, we could follow [wang2019normalized] to estimate its 6D pose and size. In [wang2019normalized], the Umeyama algorithm [umeyama1991least] is adopted within a RANSAC [fischler1981random] framework to robustly estimate the 6D pose and size of a single rigid object. However, a naive extension of the approach to each individual part in our setting would easily lead to physically impossible poses. Since the degree of freedom for each part is not independent in a kinematic chain, a part-based pose estimation approach with noisy NPCS coordinates can easily break the kinematic constraints. To cope with this issue, we need to introduce kinematic constraints while estimating the part poses. Without the kinematic constraints, the energy function regarding all part poses can be written as , where

We then introduce the kinematic constraints by adding an energy term for each joint to the energy function. In concrete terms, our modified energy function is , where is defined differently for each type of joint. For a revolute joint with parameters in the NAOCS, assuming it connects part and part , we define as:

For a prismatic joint with parameters in the NAOCS, again assuming it connects part and part , we define as:

where converts a vector into the matrix for conducting cross product with other vectors, and is defined as:

To minimize our energy function , we can no longer separately solve different part poses using the Umeyama algorithm. Instead, we first minimize using the Umeyama algorithm to initialize our estimation of the part poses. Then we fix and adopt a non-linear least-squares solver to further optimize , as is commonly done for bundle adjustment [agarwal2010bundle]. Similar to [wang2019normalized]

, we also use RANSAC for outlier removal.

After estimating the pose and size of each part, the joint states and parameters can also be deduced. For a revolute joint connecting parts and , we compute its parameters in the camera space as:

where and are computed through averaging the per-point global scaling and translation over all the points on part . The joint state can be computed as:

For a prismatic joint connecting parts and , we compute its parameters in the camera space as:

and its state is simply .

5 Evaluation

px Category Method Part-based Metrics Joint State Joint Parameter rotation error translation error 3D IoU error angle error distance error Eye- NPCS 4.0, 7.7, 7.2 0.044, 0.080, 0.071 86.9, 40.5, 41.4 8.6 , 8.4 - - NAOCS 5.6, 24.6 21.0 0.163, 0.452, 0.227 - 25.8, 22.8 - - glasses A-NCSH 3.7, 5.1, 3.7 0.035, 0.051, 0.057 87.4, 43.6, 44.5 4.1 , 4.5 0.39 , 0.17 0.044 , 0.017 Oven NPCS 1.2 , 2.9 0.030 , 0.043 75.8 , 89.0 3.0 - - NAOCS 1.7, 4.7 0.036 , 0.090 - 5.1 - - A-NCSH 1.1, 2.2 0.033 , 0.043 75.9 , 89.5 2.1 0.21 0.016 Washing NPCS 1.0, 1.9 0.042 , 0.055 86.9 , 88.1 2.2 - NAOCS 1.1 , 3.3 0.072 , 0.119 - 3.1 ° - - Machine A-NCSH 1.0 , 1.4 0.045 , 0.037 91.5 , 92.0 1.5 0.12 0.006 Laptop NPCS 11.6, 4.4 0.098, 0.044 35.7, 93.6 14.4 - - NAOCS 12.4, 4.9 0.110, 0.049 - 15.2 - - A-NCSH 6.7, 4.3 0.062, 0.044 41.1, 93.0 9.7 0.17 0.011 Drawer NPCS 1.9, 3.5, 2.4, 1.8 0.032, 0.038, 0.024, 0.025 82.8, 71.2, 71.5, 79.3 0.026, 0.031, 0.046 - - NAOCS 1.5, 2.5, 2.5, 2.0 0.044, 0.045, 0.073, 0.054 - 0.043, 0.066, 0.048 - - A-NCSH 1.0, 1.1, 1.2, 1.5 0.024, 0.021, 0.021, 0.033 84.0,72.1, 71.7, 78.6 0.011, 0.020, 0.030 0.13, 0.13, 0.13 -

Table 1: Performance comparison on unseen object instances. The categories eyeglasses, oven, washing machine, and laptop contain only revolute joints and the drawer category contains three prismatic joints.

5.1 Experimental Setup


We use the following to evaluate and compare our algorithm’s performance.

  • Per-part metrics. For each part, we evaluate rotation error measured in degrees, translation error in normalized part coordinate space, and 3D intersection over union (IoU) [song2016deep] of the predicted amodal bounding box. For each case, we normalize the translation so that the translation errors are comparable among parts with different sizes.

  • Joint state. For each revolute joint, we find the joint angle errors in degrees. For each prismatic joint, we compute the error of relative translation amounts. The relative translation is defined in the NAOCS.

  • Joint parameter. For each revolute joints, we evaluate the orientation error of the rotation axis in degrees, and the location error using the minimum line-to-line distance in NAOCS. For each prismatic joint, we compute the orientation error of the translation axis.


We have evaluated our algorithm using both synthetic and real-word datasets. To generate the synthetic testset, we used a different set of the object instances from [wang2019shape2motion] that do not overlap with our training data. Following the same rendering pipeline with random camera viewpoints, we generated on average 3000 testing images of unseen object instances for each object category. For the real data, we evaluated our algorithm on the dataset provided by Michel et al. [michel2015pose], which contains depth images for 4 different objects captured using the Kinect.


There are no existing methods for category-level articulated object pose estimation. We therefore used ablated versions of our system for baseline comparison.

  • NPCS. This algorithm predicts part segmentation and NPCS for each part (without the joint parameters). The prediction allows the algorithm to infer part pose, amodal bounding box for each part, and joint state for revolute joint by treating each part as a independent rigid body. However, it is not able to perform combined optimization with the kinematic constraints.

  • NAOCS. This algorithm predicts part segmentation for each part and NAOCS representation for the whole object instance. The prediction allows the algorithm to infer part pose and joint state, but not the amodal bounding boxes for each part since the amodal bounding boxes is not defined in the NAOCS alone.

  • Direct joint voting. This algorithm directly votes for joint-associated parameters in camera space, including offset vectors and orientation for each joint from the point cloud using PointNet++ [qi2017pointnet++] segmentation architecture.

Our final algorithm predicts the full A-NCSH representation that includes NPCS, joint parameters, and per-point global scaling and translation value that can be used together with the NPCS prediction for computing NAOCS.

Figure 4: Qualitative Results. Top tow rows show test results on unseen object instances from the Shape2Motion dataset [wang2019shape2motion]. Bottom two rows show test result on seen instances in the real-world dataset [michel2015pose]. Here we visualize the predicted amodal bounding box for each parts. Color images are for visualization only.

5.2 Experimental Results

Figure 4 presents some qualitative results. Tables 1 to 3 summarize the quantitative results. Following paragraphs provide our analysis and discussion of the results.

Effect of global optimization.

First, we want to examine how global optimization would influence the accuracy of articulated object pose estimation, using both predicted joint parameters and predicted part poses. To see this, we compare the algorithm performance between NPCS and A-NCSH, where NPCS performs a per-part pose estimation and A-NCSH performs a combined optimization using the full kinematic chain to constrain the result. The results in Table 1

show that the combined optimization of joint parameters and part pose consistently improve the algorithm’s accuracy for almost all object categories and evaluation metrics. The improvement is particularly salient for thin object parts such as the two temples of eyeglasses (the parts that extend over the ears), where the per-part based method produces large pose error due to limited point observation and shape ambiguity. This result demonstrates that the joint parameters predicted in the NPCS can regularize the part poses based on kinematic chain constraints during the combined pose optimization step and improve the pose estimation accuracy.

Comparison on joint parameters estimation.

Voting the location and orientation of joints in camera space directly with all degrees of freedom is challenging. In our approach, we choose to predict the joint parameters in NPCS since it provides a canonical representation where the joint axes usually have a strong orientation prior, which simplified prediction considerably. We further use a voting-based scheme to reduce prediction noise. Based on the high-quality prediction of part poses, we can then transform the joint parameter predictions from NPCS into the 3D reference frame of the camera. Comparing to the direct voting baseline using PointNet++, Table 2 shows that our approach significantly improves joint axis prediction for unseen instances.

px Category Methods angle error distance error Eye- PointNet++ 2.9, 15.7 0.129, 0.183 glass A-NCSH 0.39, 0.17 0.044, 0.017 Oven PointNet++ 27.0 0.017 A-NCSH 0.21 0.016 Washing PointNet++ 8.67 0.012 Machine A-NCSH 0.12 0.006 Laptop PointNet++ 29.5 0.011 A-NCSH 0.17 0.011 Drawer PointNet++ 4.9,5.0,5.1 A-NCSH 0.13,0.13,0.13 -

Table 2: A comparison of joint parameter predictions .

Handling cases under severe occlusion.

To further investigate our algorithm’s robustness under occlusion, we manually controlled the occlusion levels in the test image and evaluated the algorithm’s performance. The occlusion level is defined as the ratio of number of visible points of a specific object part, compared to the full part size. We divided the occlusion level into three categories: visible, visible, and visible. Table 3 show the test results for the temple parts of unseen eyeglass instances. We can observe that while the occlusion level increases, the performance of the NPCS algorithm drops drastically (% in 3D IoU). By comparison, our algorithm is still able to preform reasonably well, with a slight drop of 1.3%.

px Occlusion (Low to High) NPCS 42.7 41.0 38.5 A-NCSH 44.8 44.3 43.5

Table 3: Performance (IoU%) under occlusion.

px Object Sequence Brachmann et al.[brachmann2014learning] Frank et al.[michel2015pose] A-NCSH (Ours) Laptop 1 all 8.9% 64.8% 94.1% parts 29.8% 25.1% 65.5% 66.9% 97.5% 94.7% 2 all 1% 65.7% 98.4% parts 1.1% 63.9% 66.3% 66.6% 98.9% 99.0% Cabinet 3 all 0.5% 95.8% 90.0% parts 86% 46.7% 2.6% 98.2% 97.2% 96.1% 98.9% 97.8% 91.9% 4 all 49.8% 98.3% 94.5% parts 76.8% 85% 74% 98.3% 98.7% 98.7% 99.5% 99.5% 94.9% Cupboard 5 all 90% 95.8% 93.9% parts 91.5% 94.3% 95.9% 95.8% 99.9% 93.9% 6 all 71.1% 99.2% 99.9% parts 76.1% 81.4% 99.9% 99.2% 100% 99.9% Toy train 7 all 7.8% 98.1% 68.4% parts 90.1% 17.8% 81.1% 52.5% 99.2% 99.9% 99.9% 99.1% 92.0% 68.5% 99.3% 99.2% 8 all 5.7% 94.3% 91.1% parts 74.8% 20.3% 78.2% 51.2% 100% 100% 97% 94.3% 100% 100% 100% 91.1%

Table 4: Instance-level real-world depth benchmark. While not designed for instance-level articulated object pose estimation, our algorithm is able to achieve comparable performance compare to the state-of-the-art approach and improves the performance for challenging cases such as laptops.

Generalization to real depth images.

We have also tested our algorithm’s ability to generalize to real-world depth images on the dataset provided in [michel2015pose]. The dataset is designed for instance-level articulated object pose estimation, and it contains four different object instances: laptop, cupboard, 4-part toy train, and cabinet. Each has two test sequences, captured using a Kinect. Following the same training protocol, we train the algorithm with synthetically rendered depth images of object instances using the URDF files provided in [michel2015pose]. Then we test the pose estimation accuracy on the real world depth image. Although our algorithm is not specifically designed for instance-level pose estimation and the network has never been trained using any real-world depth images, our algorithm achieves strong performance on par with or even better than state-of-the-art. Table 4 and Figure 8 show example pose estimation results.

6 Conclusion

This paper has presented an approach for category-level pose estimation of articulated objects from a single depth image. To accommodate unseen object instances and different articulation types, we proposed a new representation scheme known as Articulation-Aware Normalized Coordinate Space Hierarchy (A-NCSH). The system formulates articulated pose fitting from the A-NCSH space as a combined optimization problem, taking both rigid parts pose fitting and joint constraints into consideration. Our experiments demonstrate that the A-NCSH representation and the global optimization approach significantly improve algorithm accuracy in both part pose prediction and joint parameter estimation.

7 Acknowledgement

This research was supported by a grant from Toyota-Stanford Center for AI Research. This research used resources provided by Advanced Research Computing within the Division of Information Technology at Virginia Tech. We thank Vision and Learning Lab at Virginia Tech for help on visualization tools. We are also grateful for financial and hardware support from Google.


Appendix A Real-world instance-level benchmark.

Table 4 shows quantitative comparison of AD accuracy on ICCV2015 Articulated Object Challenge [michel2015pose], which contains RGB-D data with 4 articulated objects: laptop, cabinet, cupboard and toy train. Qualitative results are visualized in Figure 8. This dataset provides 2 testing sequences for each object. Each sequence contains around 1000 images captured by having a RGB-D camera slowly moving around the object. Objects maintain the same articulation state within each sequence. Each part of the articulated object is annotated with its 6D pose with respect to the known CAD model. Since no training data is provided, we use the provided CAD models to render synthetic depth data, with 10 groups of random articulation status considered. We render object masks for the testing sequences with Pybullet[coumans2018]. Following the original paper [michel2015pose], we adopt of the object part diameter as the threshold to compute Averaged Distance (AD) accuracy, and test the performance on each sequence separately.

Appendix B Additional results

Figure 7 shows additional qualitative results on the Shape2Motion [wang2019shape2motion] dataset.

Appendix C Comparison on joints estimation.

Table 2 shows the result comparison on joint parameter estimation for all object categories. The result shows that our approach consistently outperforms the PointNet++ regression baseline on both joint orientation and location accuracy. Note that the prismatic joints in drawers don’t have a pivot point so there is no need for evaluating distance error.

px Category Methods angle error distance error Eye- PointNet++ 2.9, 15.7 0.129, 0.183 glasses A-NCSH 0.39, 0.17 0.044, 0.017 Oven PointNet++ 27.0 0.017 A-NCSH 0.21 0.016 Washing PointNet++ 8.67 0.012 Machine A-NCSH 0.12 0.006 Laptop PointNet++ 29.5 0.011 A-NCSH 0.17 0.011 Drawer PointNet++ 4.9,5.0,5.1 A-NCSH 0.13,0.13,0.13 -

Table 5: A comparison of joint parameter predictions.

Appendix D Part definition

Figure 5 shows the index definitions of parts for each object category used in the main paper.

px Category Part 0 Part 1 Part 2 Part 3 Eyeglasses center left temple right temple - Oven base door - - Washing base door - - Machine Laptop base display - - Drawer base lowest middle top

Figure 5: Part definitions for each object category.

Appendix E Limitation and failure cases

Figure 6 shows typical failure cases of our algorithm. A typical failure mode of our algorithm is the inaccurate prediction under heavy occlusion where one of the object parts is almost not observed. Figure 6 shows one of such cases where one of the eye-glasses temples is almost completely occluded. Also, under the situation of heavy occlusion for prismatic joints, there is considerate ambiguity for A-NCSH prediction on the size of the heavily occluded parts, as shown in Figure 6. However, NAOCS representation does not suffer from the size ambiguity, thus leading to a more reliable estimation of the joint state (relative translation distance compare to the rest state) and joint parameters (translation axis).

Figure 6: Failure cases. Left column shows failure cases on unseen eyeglasses instances, when a part is under heavy occlusion and barely visible. Right column shows the failure case on unseen drawers, when there are shape variations on parts and only the front area of the drawer is visible. The predicted drawer size is bigger than the real size. Although the box prediction is wrong, our method can reliably predict the joint state and joint parameters by leveraging the NAOCS representation.
Figure 7: Additional results on category-level Shape2Motion dataset. The first column shows the input point clouds; the second column shows our prediction and ground truth part segmentation mask; the third and fourth column show our prediction and ground truth NPCS and NAOCS, where the RGB channels encode the coordinates; the fifth column visualizes joint voting, where the arrows represent offset vectors to rotational hinge for revolute joints and the direction of joint axis for prismatic joints; the sixth column visualizes per part 3D bounding boxes, together with joint parameters.
Figure 8: Additional results on real-world instance-level depth dataset. More qualitative results on all 4 objects from ICCV2015 Articulated Object Challenge [michel2015pose] are shown here, with toy train, cupboard, laptop, cabinet from up-pest row to lowest row in order. Only depth images are used for pose estimation, RGB images are shown here for better reference. For each object, we estimate 3D tight bounding boxes to all parts on the kinematic chain, and project the predicted bounding boxes back to the depth image.