CloudPose
Code for "6D Object Pose Regression via Supervised Learning on Point Clouds" @ICRA2020
view repo
This paper addresses the task of estimating the 6 degrees of freedom pose of a known 3D object from depth information represented by a point cloud. Deep features learned by convolutional neural networks from color information have been the dominant features to be used for inferring object poses, while depth information receives much less attention. However, depth information contains rich geometric information of the object shape, which is important for inferring the object pose. We use depth information represented by point clouds as the input to both deep networks and geometrybased pose refinement and use separate networks for rotation and translation regression. We argue that the axisangle representation is a suitable rotation representation for deep learning, and use a geodesic loss function for rotation regression. Ablation studies show that these design choices outperform alternatives such as the quaternion representation and L2 loss, or regressing translation and rotation with the same network. Our simple yet effective approach clearly outperforms stateoftheart methods on the YCBvideo dataset. The implementation and trained model are avaliable at: https://github.com/GeeeG/CloudPose.
READ FULL TEXT VIEW PDFCode for "6D Object Pose Regression via Supervised Learning on Point Clouds" @ICRA2020
The problem of 6 degrees of freedom (6D) object pose estimation is to determine the transformation from a local object coordinate system to a reference coordinate system (e.g., camera or robot coordinate)
[19]. The transformation is composed of 3D location and 3D orientation. Robust and accurate 6D object pose estimation is of primary importance for many robotic applications such as grasping and dexterous manipulation. The recent success of convolutional neural networks (CNNs) in visual recognition has inspired methods that use deep networks for learning features from RGB images [15, 23, 30]. These learned features are used for inferring 6D object poses. Similarly, CNNs can also be applied to RGBD images and treat depth information as an additional channel for feature learning [16, 33, 20, 3]. However, in some scenarios, color information may not be available, and depth information is not in the 2dimensional matrix format (e.g., laser range finder data), which can be easily processed with CNNbased systems. Depth information can also be represented by a point cloud, which is an unordered set of points in a metric 3D space. In existing methods, point clouds are mainly used for pose refinement [26, 27, 32, 14]or template matching with handcrafted features extracted from point clouds
[6, 12]. Using point clouds in the registration stage confines its usage scope and handcrafted features are usually less effective compared to deeplearned features.In this work, we investigate how to accurately estimate the 6D pose of a known object represented by a point cloud segment containing only geometric information using deep networks. Our approach is inspired by PointNet [25], a deep network for object classification and segmentation operating on point clouds. We adapt the system to the problem of pose estimation. PointNet provides a method to apply deep learning to unordered point sets, and it is a suitable architecture for our purpose. Our 6D pose regression method can be applied to any type of range data, e.g., data from laser range finders.
For developing our system, we investigate three open questions. The first is how to efficiently use depth information in a deep learningbased system. Although it has been shown by many applications that CNNs can extract powerful features from RGBD information for specific tasks, due to the inherent difference between color and depth information, it is unclear whether this is an efficient way to treat depth information. We argue that a point cloud is a more suitable structure and should be used in the scope of both deep networks and geometrybased optimization.
The second question is whether translation and orientation should be estimated with separate networks or a single network in a supervised learning system. During a supervised learning process, a network learns the mapping from its input to the desired output guided by a loss function. Since the metric units for translation and orientation are different (i.e., meters and radians), we argue that regressing them using separate networks and loss functions is a more suitable choice. Our experiments show that an architecture with separate networks outperforms those with shared layers.
Another question is the choice of rotation representation and the loss function for measuring the distance between two rotations. Quaternions have been a popular choice for many learningbased systems [32, 31]. However, quaternions have the unitnorm constraint which imposes a limit on the network output range. We argue that axisangle is a more suitable choice because it is a constraintfree representation. Concerning loss functions, L2 loss is a popular choice for measuring the distance between two rotations [31, 17]. We argue that the geodesic distance is a more suitable choice since it is well justified mathematically, and provides a clearer learning goal compared to the L2 loss. We show experimentally that our arguments are valid.
Our contributions are thus as follows:
We present a simple yet effective system that infers the 6D pose of known objects represented by point cloud segments containing geometric information. This system is based on PointNet and exploits the point cloud structure by utilizing it as the input to both deep networks and an iterative closest point (ICP) refinement process. To the best of our knowledge, this is the first deep learning architecture based system that regresses 6D poses from only unordered point sets.
We demonstrate that the proposed method outperforms stateoftheart methods on a public dataset. Experimental results show that our system outperforms methods that use both color and depth information during pose inference stage. This evaluation result indicates that the proposed system is an efficient way to use depth information for pose estimation.
Ablation studies provide an evaluation of each system component. We show experimentally that each design choice has an impact on system performance.
Pose estimation has been well studied in the past using both color [15, 23, 30] and depth data [10, 28]. Since in this work we focus on investigating how to use geometric information during the pose inference stage, we mainly review works that use depth information.
With the common usage of depth cameras and laser range finders (e.g. LiDAR), methods using depth information have been proposed [10, 29, 2, 19, 22]. LINEMOD [10, 11]
is one of the first works that use handcrafted features from depth information for pose estimation. It uses surface normals as part of its local patch features. This patch representation is adapted and used with random forest in
[29, 28]. Surface normal is also used as an additional modality in [33]. Another way of using depth information is to treat it as an extra image depth channel (RGBD) and feed it into a CNN [16, 1, 33, 20, 3], or random forest [2, 19, 22]or a fully connected sparse autoencoder
[5] for feature extraction. Depth is also used to create point clouds, which are used for generating pose hypotheses with 3D3D correspondences and ICP refinement in [14]. Point clouds are used to facilitate PointtoPoint matching in [6, 12]. Geometry embeddings are extracted from point cloud segments with a deep network in [31]. Approaches such as [32] use color information to provide an initial pose estimate, then refine it with ICP using depth information. Point cloud segments are also used for pose estimation in [24]. However, they only report experimental data for pose estimation for a single angle as opposed to all three as we do. Furthermore, they formulate rotation estimation as classification and discretize the rotation angles to bins. It is not clear if this approach would scale up to three rotation angles. In our previous work [8], we only considered rotation regression for single objects. Here, we regress the full 6D pose in a multiclass setting.For learningbased systems, [32] propose to predict translation and rotation with separate networks sequentially. [31] propose to regress translation and rotation with the same network. Quaternions are used as the rotation representation for regression [32, 31, 3]. Bui et al. [3] propose to use L2 loss function for rotation learning. Axisangle representation is also used in [21]. However, they only address estimating 3D rotation from RGB information, while we address both 3D rotation and 3D translation from point clouds. Works that use deep networks for extracting features and do nearest neighbor search pose retrieval or classification into discretized bins to obtain object poses [16, 33, 20, 1, 24] are not in the scope of this work and are not discussed in detail here.
Our work is most similar to DenseFusion [31] as we both use PointNet [25] to extract features from point cloud segments. However, there are two significant differences: first, during the pose estimation stage, we only use coordinate information from point clouds while DenseFusion also uses color information extracted by a CNN; second, we design regression targets and loss functions for rotation and translation individually while DenseFusion uses one regression target for the 6D pose.
Figure 1 shows an overview of our system. The proposed system for object 6D pose estimation is a multiclass system, i.e., we use the same system to predict poses for objects from different classes. Hence, an object segment, as well as the corresponding class information, is required as input to the pose estimation networks. As semantic segmentation is a wellstudied problem, we assume the object segment and class information is provided by an offtheshelf method^{1}^{1}1Here, we use the semantic segmentation from [32]. and focus on the object pose estimation from a point cloud segment in this work. A point cloud segment is created using depth and the target object segment. This segment is processed with Farthest Point Sampling (FPS) [7] to obtain a downsampled segment with a consistent surface structure representation. This downsampled segment and class information are combined as the input for two separate networks for rotation estimation in axisangle representation and translation prediction through translation residual regression. The 6D pose is refined with a geometrybased optimization process to produce the final pose estimate.
Figure (a)a illustrates BaseNet, which is the basic building block of our system. BaseNet is an adapted version of PointNet [25]. Given a point cloud with points as input, PointNet is invariant to all
possible permutations. Each point is processed independently using multilayer perceptrons (MLPs) with shared weights. Compared to PointNet, we remove the spatial transformer blocks and adapt the dimension for the output layer to be
. For each input point with class information, a feature vector is learned with shared weights. These feature vectors are maxpooled to create a global representation of the input point cloud. Finally, we use a threelayer regression MLP on top of this global feature to predict the poses.
Figure (b)b shows a more detailed diagram of our system. We use two separate networks to handle translation and rotation estimation. Input to the rotation network is a point cloud with
points concatenated with the onehot encoded class information. In total, the input is a
by array where is the total number of classes. The output of the rotation network is the estimated rotation in axisangle representation. The translation network takes normalized point coordinates concatenated with class information and estimates the translation residual. The full translation is obtained by adding back the coordinate mean.This section describes how the loss functions for 6D pose regression are formulated in our supervised learning framework. Given a set of points on the surface of a known object in the camera coordinate, the aim of pose estimation is to find a transformation that transforms from the object coordinates to the camera coordinates. This transformation is composed of a translation and a rotation. A translation consists of the displacements along the three coordinate axes. A rotation specifies the rotation around the three coordinate axes, and it has different representations such as axisangle, Euler angles, quaternion, and rotation matrix. For supervised learning, suitable loss functions are required to measure the differences between predicted poses and ground truth poses. For rotation learning, we argue that the axisangle representation is the best suited for the learning task. Geodesic distance is used as the loss function for rotation regression. For translation learning, we predict the residual of translation.
In the axisangle representation, a vector represents a rotation of radians around the unit vector [9]. Given an axisangle representation , the corresponding rotation matrix is obtained via the exponential map
(1) 
where
is the identity matrix and
is the skewsymmetric matrix
(2) 
For rotation learning, we regress to a predicted rotation . Prediction is compared with ground truth rotation via a rotation loss function , which is the geodesic distance between and [13, 9]:
(3) 
where and are the two rotation matrices corresponding to and , respectively.
This loss function directly measures the magnitude of rotation difference between and , so it is convenient to interpret. Furthermore, the network can make constraintfree predictions with axisangle representations, in contrast to e.g. quaternion representations which require normalization.
To simplify the learning task by reducing the variance in regression space for translation prediction, the learning target is chosen to be the residual of translation. Given a translation residual
, full translation prediction is obtained via(4) 
where is the mean of . norm is used to measure the distance between prediction and ground truth , resulting in the translation loss function :
(5) 
The total loss is defined as the combination of the translation and the rotation loss:
(6) 
where is a scaling factor. The total loss is used for training the pose estimation networks.
We evaluate the proposed system on the YCBVideo dataset [32] and compare the performance with the stateoftheart methods PoseCNN [32] and DenseFusion [31]. We also compare the performance on a subset of the object classes with a stateoftheart RGBbased method DOPE [30]. Besides prediction accuracy and performance under occlusions, we also investigate the impact of using different network structures, as well as the influence of different rotation representations. The implementation of our system is available online^{2}^{2}2https://github.com/GeeeG/CloudPose.
The YCB video dataset [32] contains 92 video sequences with total 133,827 frames of 21 objects selected from the YCB object set [4] with 6D pose annotations. We follow the official train/test split and use 80 video sequences for training. Testing is performed on the key frames chosen from the remaining 12 sequences [32]. 80,000 frames of synthetic data are also provided by YCBVideo dataset as an extension to the training set. During training, Adam optimizer is used with a learning rate of . The batch size is . For the total loss, we use , which is given by the ratio between the expected error of translation and rotation at the end of the training [18]. The number of points of the input point cloud segment is
. Batch normalization is applied to all layers. No dropout is used. All of our networks are trained for
epochs. For refinement, we use the PointtoPoint ICP registration provided by Open3D [34] and refine for 10 iterations. The initial search radius is m and is reduced by after each iteration. For a fair comparison, all methods use object segmentation provided by PoseCNN during testing.We use the average distance (AD) of model points and the average distance for a rotationally symmetric object (ADS) proposed in [11]
as evaluation metrics. Given a 3D model represented as a set
with points, ground truth rotation and translation , as well as estimated rotation and translation , the AD is defined as:(7) 
ADS is computed using closest point distance. It provides a distance measure that considers possible pose ambiguities caused by rotational symmetry:
(8) 
A 6D pose estimate is considered to be correct if AD and ADS are smaller than a given threshold. We report the area under error thresholdaccuracy curve (AUC) for AD and ADS. The maximum thresholds for both curves are set to m. Furthermore, we also provide ADS accuracy with threshold m (1cm) to illustrate the performance accuracy under a smaller error tolerance.
Evaluation results averaged for all 21 objects in the YCBVideo dataset are shown in Table I. PoseCNN [32] uses RGB information to provide an initial pose estimate (PC w/o ICP), then uses depth information with a highly customized ICP for pose refinement (PC). DenseFusion [31] (DF) uses both color and point cloud features extracted by deep networks to give perpixel pose estimate for final pose voting, and iterative pose refinement is performed with an extra network module. Ours w/o ICP is the estimated pose from the proposed system architecture (Section III), and Ours is the result after ICP refinement. We also perform the ICP refinement on DF results (DF+ICP). For the overall performance in Table I, we highlight the best performance in bold font. Details regarding the data type used by pose regression networks and the post process are also presented.
Our method achieves stateoftheart performance using only depth information. In terms of AD, we outperform both PC an DF. We observe that DF+ICP shows small improvement compared to DF. One possible reason is the sensitivity of ICP to the initial pose guess, if the method already performs well without refinement, ICP is able to provide further gains. If the initial guess is poor, ICP can even make the results worse. This result indicates that features learned from depth information represented by unordered point clouds are sufficient for accurately regressing 6D pose. Furthermore, this also shows that the proposed approach is an efficient way to use depth information in a deep learning framework for pose regression.
Performance for individual objects is shown in Table II. We use the trained network for six objects provided by the authors of DOPE [30] and report the results. The AD results are not available because the object coordinate frames used in the YCB object dataset [4], YCB video dataset for PoseCNN [32] and DOPE are different. As our method uses the frames from [32], and the transformation between [4] and [32] is not publicly available, we can not find the correspondence between model points required for AD. We also applied ICP to DOPE pose estimates, but the performance was not improved. A possible reason is the sensitivity of ICP to the initial pose estimate.
Some qualitative results are shown in Figure 17. Pose estimates from PC, DF and our method are used for projecting object models onto 2D images. More qualitative results are available in the supplementary video.
RGB  Depth  ICP  AD  ADS  <1 cm  

PC w/o ICP [32]  ✓  51.5  75.6  26.1  
PC [32]  ✓  ✓  ✓  77.8  93.6  88.4 
DF [31]  ✓  ✓  74.7  93.9  87.6  
DF [31] + ICP^{3}^{3}3we apply ICP to DF results after iterative refinement.  ✓  ✓  ✓  76.3  94.7  89.0 
Ours w/o ICP  ✓  76.0  91.3  80.9  
Ours  ✓  ✓  82.7  94.7  90.3 
DOPE [30]  PC [32]  DF [31]  Ours  

ADS  <1cm  AD  ADS  <1cm  AD  ADS  <1cm  AD  ADS  <1cm  
02_master_chef  —  —  68.1  95.8  99.5  73.2  96.4  100  46.9  95.4  95.4 
03_cracker_box  62.7  29.6  83.4  92.7  84.8  94.2  95.8  97  76.7  93  80.4 
04_sugar_box  85.0  33.4  97.1  98.2  100  96.5  97.6  100  97.5  98.5  99.7 
05_tomato_soup  88.5  74.5  83.6  96.6  99  87.4  96.6  99.1  72.7  96.5  96.8 
06_mustard_bottle  90.7  65.3  98  98.6  98.9  94.8  97.3  97.8  79.2  97.7  94.1 
07_tuna_fish_can  —  —  83.9  97.1  97.6  81.8  97.1  99.5  72  97.7  100 
08_pudding_box  —  —  96.6  97.9  100  93.2  95.9  98.6  94.4  97.3  91.1 
09_gelatin_box  84.6  36.9  98.1  98.8  100  96.7  98  100  98.6  99  100 
10_potted_meat  32.0  3.7  86  94.3  87.5  87.8  95  92  90.6  95.7  93.7 
11_banana  —  —  91.9  97.1  95  83.6  96.2  98.2  95.1  97.7  95.5 
19_pitcher_base  —  —  96.9  97.8  99.6  96.6  97.5  99.5  96.1  97.9  100 
21_bleach_clean  —  —  92.5  96.9  95.1  89.7  95.8  99.4  95.4  97.4  98.4 
24_bowl  —  —  14.4  81  42.9  5.9  89.5  55.7  83.9  97.7  99.3 
25_mug  —  —  81.1  94.9  97.6  88.8  96.7  98  93.9  97.8  99.7 
35_power_drill  —  —  97.7  98.2  99.3  93  96.1  97.8  94.9  97.7  96.7 
36_wood_block  —  —  70.9  87.6  74.4  30.9  92.8  88.8  90  94.9  97.5 
37_scissors  —  —  78.4  91.7  68  77.4  91.9  71.3  75.8  91.3  63 
40_large_marker  —  —  85.3  97.2  97.1  93  97.6  100  92.2  98  100 
51_large_clamp  —  —  52.2  75.3  67.4  26.4  72.6  33.3  68.5  77.4  69.6 
52_e_large_clamp  —  —  25.9  74.9  48.2  16.6  77.4  10.9  25.3  66.4  22 
61_foam_brick  —  —  48.1  97.2  99.7  59  92  100  92.9  98  99.3 
For a given target object in a frame, the occlusion factor of the object is defined as [8]
(9) 
where is the number of pixels in the 2D ground truth segmentation, and is the number of pixels in the projection of the 3D object model onto the image plane using the camera intrinsic parameters and the ground truth 6D pose, when we assume the object would be fully visible. The occlusion factor of the YCBVideo dataset ranges from to . We divide this range into bins with a bin width of and report the prediction accuracy (ADS) with a threshold of cm. Figure 18 illustrates the results. It can be observed that our method (Ours) has competitive performance when the occlusion is lower than , then both ours and PC start to suffer as the amount of occlusion increases. One possible reason is that DF outputs perpixel prediction with confidence scores, while ours and PC provide only one pose prediction. This perpixel prediction may have helped to provide better performance when the amount of occlusion is higher than .
To investigate whether translation and rotation should be regressed with the same or separate networks, we compare the performance of different architectures. We alter the network architecture by incrementally sharing the layers between translation and rotation networks. Table III shows the result in terms of AD, ADS, and accuracies for translation and rotation under certain thresholds. None denotes the proposed architecture which regresses translation and rotation with two separate networks. The numbers in the first column denote the number of shared layers between translation and rotation BaseNet (Figure 4). We compare performance without ICP refinement. When sharing layers, the performance is worse than using two separate networks.
We also tested an architecture that shares all the layers while having the same amount of parameters as the proposed structure with a doubled layer width. The performance is similar to the architecture with the single width, and this verifies that the performance deterioration is not caused by insufficient network capacity. This result verifies that using separated networks for translation and rotation is a more suitable design choice.
shared layers  AD  ADS  rot_err<10°  tran_err<1 cm 

none  76.0  91.3  41.3  73.0 
1  75.2  91.1  39.5  72.0 
2  75.5  91.1  38.6  69.8 
3  75.1  91.1  41.2  69.6 
4  75.5  91.2  39.4  70.3 
5  75.2  91.1  39.1  69.7 
all  63.0  87.8  23.3  69.3 


We investigate the impact of different rotation representations and loss functions. For comparing quaternion to axisangle, we adapted our rotation network to have dimensional output instead of . The output is normalized and then converted to the axisangle representation. We use the same loss function as described in Section IV. For comparing L2 loss with Geodesic distance, we keep the rotation representation in axisangle format and apply different loss functions. Table IV shows the accuracy of rotation prediction with different thresholds. With the same loss function, using axisangle yields a better result than quaternion. This indicates that axisangle is a better choice for rotation learning. With the same rotation representation, L2 loss slightly underperforms geodesic loss. Since geodesic distance also has a better mathematical justification, this makes it a better choice.
We measure the time performance on a Nvidia Titan X GPU. The system is implemented with Tensorflow. Pose estimation by a forward pass through our network takes 0.11 seconds for a single object. The 10 iterations of ICP refinement require an additional 0.3 seconds.
We propose a system for fast and accurate 6D pose estimation of known objects. We formulate the problem as a supervised learning problem and use two separate networks for rotation and translation regression, and use point clouds as input for the regression. We use axisangle as rotation representation and geodesic distance as the loss function for rotation regression. Ablation studies show that these design choices outperform the commonly used quaternion representation and L2 loss. Experimental results show that the proposed system outperforms two stateoftheart methods on a public benchmark.
To the best of our knowledge, this is the first deep learning system that regresses 6D object poses from only depth information represented by unordered point clouds. Features extracted from point clouds with deep networks can be used for accurately regressing object pose. Our pose regression system can be applied to range data from other sensors such as laser range finders. In the future work, we will investigate aspects such as pose estimation for rotational symmetry objects using only geometric information.
International Journal of Computer Vision
103 (3), pp. 267–305. Cited by: §IVA, §IVA.