1 Introduction
The 6D pose of an object is composed of 3D location and 3D orientation. The pose describes the transformation from a local coordinate system of the object to a reference coordinate system (e.g. camera or robot coordinate) [20], as shown in Figure 1. Knowing the accurate 6D pose of an object is necessary for robotic applications such as dexterous grasping and manipulation. This problem is challenging due to occlusion, clutter and varying lighting conditions.
Many methods for pose estimation using only color information have been proposed [17, 25, 32, 21]. Since depth cameras are commonly used, there have been many methods using both color and depth information [1, 18, 15]. Recently, there are also many CNN based methods [18, 15]. In general, methods that use depth information can handle both textured and textureless objects, and they are more robust to occlusion compared to methods using only color information [18, 15].
The 6D pose of an object is an inherently continuous quantity. Some works discretize the continuous pose space [8, 9], and formulate the problem as classification. Others avoid discretization by representing the pose using, e.g., quaternions [34], or the axisangle representation [22, 4]. Work outside the domain of pose estimation has also considered rotation matrices [24], or in a more general case parametric representations of affine transformations [14]. In these cases the problem is often formulated as regression. The choice of rotation representation has a major impact on the performance of the estimation method.
In this work, we propose a deep learning based pose estimation method that uses point clouds as an input. To the best of our knowledge, this is the first attempt at applying deep learning for directly estimating 3D rotation using point cloud segments. We formulate the problem of estimating the rotation of a rigid object as regression from a point cloud segment to the axisangle representation of the rotation. This representation is constraintfree and thus wellsuited for application in supervised learning.
Our experimental results show that our method reaches stateoftheart performance. We also show that our method exceeds the stateoftheart in pose estimation tasks with moderate amounts of occlusion. Our approach does not require any postprocessing, such as pose refinement by the iterative closest point (ICP) algorithm [3]. In practice, we adapt PointNet [24]
for the rotation regression task. Our input is a point cloud with spatial and color information. We use the geodesic distance between rotations as the loss function.
The remainder of the paper is organized as follows. Section 2 reviews related work in pose estimation. In Section 3, we argue why the axisangle representation is suitable for supervised learning. We present our system architecture and network details in Section 4. Section 5 presents our experimental results. In Section 6 we provide concluding remarks and discuss future work.
2 Related work
6D pose estimation using only RGB information has been widely studied [17, 25, 32, 21]. Since this work concentrates on using point cloud inputs, which contain depth information, we mainly review works that also consider depth information. We also review how depth information can be represented.
2.1 Pose estimation
RGBD methods. A template matching method which integrates color and depth information is proposed by Hinterstoisser et al. [8, 9]. Templates are built with quantized image gradients on object contour from RGB information and surface normals on object interior from depth information, and annotated with viewpoint information. The effectiveness of template matching is also shown in [12, 19]. However, template matching methods are sensitive to occlusions [18].
Votingbased methods attempt to infer the pose of an object by accumulating evidence from local or global features of image patches. One example is the LatentClass Hough Forest [31, 30] which adapts the template feature from [8] for generating training data. During inference stage, a random set of patches is sampled from the input image. The patches are used in Hough voting to obtain pose hypotheses for verification.
3D object coordinates and object instance probabilities are learned using a Decision Forest in
[1]. The 6D pose estimation is then formulated as an energy optimization problem which compares synthetic images rendered with the estimated pose with observed depth values. 3D object coordinates are also used in [18, 23]. However, those approaches tend to be very computationally intensive due to generation and verification of hypotheses [18].Most recent approaches rely on convolutional neural networks (CNNs). In [20], the work in [1] is extended by adding a CNN to describe the posterior density of an object pose. A combination of using a CNN for object segmentation and geometrybased pose estimation is proposed in [16]. PoseCNN [34] uses a similar twostage network, in which the first stage extracts feature maps from RGB input and the second stage uses the generated maps for object segmentation, 3D translation estimation and 3D rotation regression in quaternion format. Depth data and ICP are used for pose refinement. Jafari et al. [15] propose a threestage, instanceaware approach for 6D object pose estimation. An instance segmentation network is first applied, followed by an encoderdecoder network which estimates the 3D object coordinates for each segment. The 6D pose is recovered with a geometric pose optimization step similar to [1]. The approaches [20, 15, 34] do not directly use CNN to predict the pose. Instead, they provide segmentation and other intermediate information, which are used to infer the object pose.
Point cloudbased. Drost et al. [5] propose to extract a global model description from oriented point pair features. With the global description, scene data are matched with models using a voting scheme. This approach is further improved by [10] to be more robust against sensor noise and background clutter. Compared to [5, 10], our approach uses a CNN to learn the global description.
2.2 Depth representation
Depth information in deep learning systems can be represented with, e.g., voxel grids [28, 26], truncated signed distance functions (TSDF) [29], or point clouds [24]. Voxel grids are simple to generate and use. Because of their regular grid structure, voxel grids can be directly used as inputs to 3D CNNs. However, voxel grids are inefficient since they also have to explicitly represent empty space. They also suffer from discretization artifacts. TSDF tries to alleviate these problems by storing the shortest distance to the surface represented in each voxel. This allows a more faithful representation of the 3D information. In comparison to other depth data representations, a point cloud has a simple representation without redundancy, yet contains rich geometric information. Recently, PointNet [24] has allowed to use raw point clouds directly as an input of a CNN.
3 Supervised learning for rotation regression
The aim of object pose estimation is to find the translation and rotation that describe the transformation from the object coordinate system to the camera coordinate system (Figure 1). The translation consists of the displacements along the three coordinate axes, and the rotation specifies the rotation around the three coordinate axes. Here we concentrate on the problem of estimating rotation.
For supervised learning, we require a loss function that measures the difference between the predicted rotation and the ground truth rotation. To find a suitable loss function, we begin by considering a suitable representation for a rotation. We argue that the axisangle representation is the best suited for a learning task. We then review the connection of the axisangle representation to the Lie algebra of rotation matrices. The Lie algebra provides us with tools needed to define our loss function as the geodesic distance of rotation matrices. These steps allow our network to directly make predictions in the axisangle format.
Notation.
In the following, we denote by vector or matrix transpose. By , we denote the Euclidean or 2norm. We write
for the 3by3 identity matrix.
3.1 Axisangle representation of rotations
A rotation can be represented, e.g., as Euler angles, a rotation matrix, a quaternion, or with the axisangle representation. Euler angles are known to suffer from gimbal lock discontinuity [11]. Rotation matrices and quaternions have orthogonality and unit norm constraints, respectively. Such constraints may be problematic in an optimizationbased approach such as supervised learning, since they restrict the range of valid predictions. To avoid these issues, we adopt the axisangle representation. In the axisangle representation, a vector represents a rotation of radians around the unit vector [7].
3.2 The Lie group
The special orthogonal group is a compact Lie group that contains the 3by3 orthogonal matrices with determinant one, i.e., all rotation matrices [6]. Associated with is the Lie algebra
, consisting of the set of skewsymmetric 3by3 matrices.
Let be an axisangle representation of a rotation. The corresponding element of is the skewsymmetric matrix
(1) 
The exponential map connects the Lie algebra with the Lie group by
(2) 
where as above^{1}^{1}1In a practical implementation, the Taylor expansions of and should be used for small for numerical stability..
Now let be a rotation matrix in the Lie group . The logarithmic map connects with an element in the Lie algebra by
(3) 
where
(4) 
can be interpreted as the magnitude of rotation related to in radians. If desired, we can now obtain an axisangle representation of by first extracting from the corresponding elements indicated in Eq. (1), and then setting the norm of the resulting vector to .
3.3 Loss function for rotation regression
We regress to a predicted rotation represented in the axisangle form. The prediction is compared against the ground truth rotation via a loss function . Let and denote the two rotation matrices corresponding to and , respectively. We use as loss function the geodesic distance of and [13, 7], i.e.,
(5) 
where we first obtain and via the exponential map, and then calculate to obtain the loss value. This loss function directly measures the magnitude of rotation between and , making it convenient to interpret. Furthermore, using the axisangle representation allows to make predictions free of constraints such as the unit norm requirement of quaternions. This makes the loss function also convenient to implement in a supervised learning approach.
4 System architecture
Figure 2 shows the system overview. We train our system for a specific target object, in Figure 2 the drill. The inputs to our system are the RGB color image, the depth image, and a segmentation mask indicating which pixels belong to the target object. We first create a point cloud segment of the target object based on the inputs. Each point has 6 dimensions: 3 dimensions for spatial coordinates and 3 dimensions for color information. We randomly sample points from this point cloud segment to create a fixedsize downsampled point cloud. In all of our experiments, we use . We then remove the estimated translation from the point coordinates to normalize the data. The normalized point cloud segment is then fed into a network which outputs a rotation prediction in the axisangle format. During training, we use the ground truth segmentation and translation. As we focus on the rotation estimation, during testing, we apply the segmentation and translation outputs of PoseCNN [34].
We consider two variants for our network presented in the following subsections. The first variant processes the point cloud as a set of independent points without regard to the local neighbourhoods of points. The second variant explicitly takes into account the local neighbourhoods of a point by considering its nearest neighbours.
4.1 PointNet (PN)
Our PN network is based on PointNet [24], as illustrated in Figure 3. The PointNet architecture is invariant to all
possible permutations of the input point cloud, and hence an ideal structure for processing raw point clouds. The invariance is achieved by processing all points independently using multilayer perceptrons (MLPs) with shared weights. The obtained feature vectors are finally maxpooled to create a global feature representation of the input point cloud. Finally, we attach a threelayer regression MLP on top of this global feature to predict the rotation.
4.2 Dynamic nearest neighbour graph (DG)
In the PN architecture, all features are extracted based only on a single point. Hence it does not explicitly consider the local neighbourhoods of individual points. However, local neighbourhoods can contain useful geometric information for pose estimation [27]. The local neighbourhoods are considered by an alternative network structure based on the dynamic nearestneighbour graph network proposed in [33]. For each point in the point set, a nearest neighbor graph is calculated. In all our experiments, we use . The graph contains directed edges , such that are the closest points to . For an edge , an edge feature is calculated. The edge features are then processed in a similar manner as in PointNet to preserve permutation invariance. This dynamic graph convolution can then be repeated, now calculating the nearest neighbour graph for the feature vectors of the first shared MLP layer, and so on for the subsequent layers. We use the implementation^{2}^{2}2https://github.com/WangYueFt/dgcnn provided by authors from [33], and call the resulting network DG for dynamic graph.
5 Experimental results
This section shows experimental results of the proposed approach on the YCB video dataset [34], and compares the performance with stateoftheart PoseCNN method [34]. Besides prediction accuracy, we investigate the effect of occlusions and the quality of the segmentation and translation estimates.
5.1 Experiment setup
The YCB video dataset [34] is used for training and testing with the original train/test split. The dataset contains 133,827 frames of 21 objects selected from the YCB object set [2] with 6D pose annotation. 80,000 frames of synthetic data are also provided as an extension to the training set.
We select a set of four objects to test on, shown in Figure 4. As our approach does not consider object symmetry, we use objects that have 1fold rotational symmetry (power drill, banana and pitcher base) or 2fold rotational symmetry (extra large clamp).
We run all experiments using both the PointNet based (PN) and dynamic graph (DG) networks. During training, Adam optimizer is used with learning rate , and batch size of
. Batch normalization is applied to all layers. No dropout is used.
For training, ground truth segmentations and translations are used as the corresponding inputs shown in Fig. 2. While evaluating 3D rotation estimation in Subsection 5.3, the translation and segmentation predicted by PoseCNN are used.
We observed that the color information represented by RGB color space varies in an inconsistent manner across different video sequences, hence all the following experimental results are obtained only with XYZ coordinate information of point cloud. Moreover, our current system does not deal with classification problem, individual network is trained for each object. Due to the difference of experimental setup between our method and PoseCNN, the performance comparison are mainly for illustrating the potential of proposed approach.
5.2 Evaluation metrics
For evaluating rotation estimation, we directly use geodesic distance described in Section 3 to quantify the rotation error. We evaluate 6D pose estimation using average distance of model points (ADD) proposed in [9]. For a 3D model represented as a set of points, with ground truth rotation and translation , and estimated rotation and translation , the ADD is defined as:
(6) 
where is the number of points. The 6D pose estimate is considered to be correct if ADD is smaller than a given threshold.
5.3 Rotation estimation
Figure 5 shows the estimation accuracy as function of the rotation angle error threshold, i.e., the fraction of predictions that have an angle error smaller than the horizontal axis value. Results are shown for PoseCNN, PoseCNN with ICP refinement (PoseCNN+ICP), and our method with PointNet structure (PN), and with dynamic graph structure (DG). To determine the effect of the translation and segmentation input, we additionally test our methods while giving the ground truth translation and segmentation as input. The cases with ground truths provided are indicated by +gt, and shown with a dashed line.
The performance without ground truth translation and segmentation is significantly worse than the performance with ground truth information. This shows that good translation and segmentation results are crucial for accurate rotation estimation. Also, by using ground truth information, the performance for extra large clamp (2fold rotational symmetry) is worse than other objects, which illustrates that the object symmetry should be taken into consideration during learning process.
The results also confirm the fact that ICP based refinement usually only improves the estimation quality if the initial guess is already good enough. When the initial estimation is not accurate enough, the use of ICP can even decrease the accuracy, as shown by the PoseCNN+ICP curve falling below the PoseCNN curve for large angle thresholds.
Effect of occlusion. We quantify the effect of occlusion on the rotation prediction accuracy. For a given frame and target object, we estimate the occlusion factor of the object by
(7) 
where is the number of pixels in the 2D ground truth segmentation, and is the number of pixels in the projection of the 3D model of the object onto the image plane using the camera intrinsic parameters and the ground truth 6D pose, when we assume that the object would be fully visible. We noted that for the test frames of the YCBvideo dataset is mostly below 0.5. We categorize as low occlusion and as moderate occlusion.
Object  Banana  Power Drill  Extra Large Clamp  

Occlusion  low  mod  low  mod  low  mod 
PoseCNN [34]  62.0°3.1°  8.2°0.25°  14.7°0.3°  37.4°2.4°  109.8°2.0°  151.0°3.6° 
PoseCNN+ICP  56.5°3.4°  7.1°0.9°  6.9°0.4°  44.1°3.5°  115.5°2.0°  140.5°6.0° 
Ours (PN)  93.3°2.2°  107.4°1.5°  65.1°1.3°  115.5°1.4°  138.4°4.3°  
Ours (DG)  82°2.5°  130.4°1.5°  51.3°1.2°  130.5°4.1°  145.7°1.7°  134.2°3.1° 
Ours (PN+gt)  9.9°0.5°  6.5°0.3°  13°0.8°  
Ours (DG+gt)  9.8°1.2°  34.1°1.6°  68.2°8.9° 
Table 1 shows the average rotation angle error (in degrees) and its confidence interval^{3}^{3}3The results for pitcher base are not reported here since all samples in testing set for pitcher base have low occlusion. for PoseCNN and our method in the low and moderate occlusion categories. We also investigated the effect of the translation and segmentation by considering variants of our methods that were provided with the ground truth translation and segmentation. These variants are shown in the table indicated by +gt.
We observe that with ground truth information, our methods shows potential in cases of both low and moderate occlusion. Furthermore, with the dynamic graph architecture (DG), the average error tends to be lower for 1fold rotational symmetry objects. This shows the local neighbourhood information extracted by DG is useful for rotation estimation when there is no pose ambiguity. One observation is that for banana, the rotation error in low occlusion is significantly higher than it is in the moderate case for PoseCNN. This is because near of the test frames in low occlusion case present an rotation error in range of to .
Qualitative results for rotation estimation are shown in Figure 6. In the leftmost column, the occlusion factor of the target object is denoted. Then, from left to right, we show the ground truth, PoseCNN+ICP, and our method using DG and our method using DG with ground truth translation and segmentation (DG+gt) results. In all cases, the ground truth pose, or respectively, the pose estimate, are indicated by the green overlay on the figures. To focus on the difference in the rotation estimate, we use the ground truth translation for all methods for the visualization. The rotation predictions for Ours (DG) are still based on translation and segmentation from PoseCNN.
The first two rows of Figure 6 show cases with moderate occlusion. When the discriminative part of the banana is occluded (top row), PoseCNN can not recover the rotation, while our method still produces a good estimate. The situation is similar in the second row for the drill. The third row illustrates that the quality of segmentation has a strong impact on the accuracy of rotation estimation. In this case the segmentation fails to detect the black clamp on the black background, which leads to a poor rotation estimate for both PoseCNN and our method. When we provide the ground truth segmentation (third row, last column), our method is still unable to recover the correct rotation due to the pose ambiguity.
6 Conclusion
We propose to directly predict the 3D rotation of a known rigid object from a point cloud segment. We use axisangle representation of rotations as the regression target. Our network learns a global representation either from individual input points, or from point sets of nearest neighbors. Geodesic distance is used as the loss function to supervise the learning process. Without using ICP refinement, experiments shows that the proposed method can reach competitive and sometimes superior performance compared to PoseCNN.
Our results show that point cloud segments contain enough information for inferring object pose. The axisangle representation does not have any constraints, making it a suitable regression target. Using Lie algebra as a tool provides a valid distance measure for rotations. This distance measure can be used as a loss function during training.
We discovered that the performance of our method is strongly affected by the quality of the target object translation and segmentation, which will be further investigated in future work. We will extend the proposed method to full 6D pose estimation by additionally predicting the object translations. We also plan to integrate object classification into our system, and study a wider range of target objects.
Acknowledgments
This work was partially funded by the German Science Foundation (DFG) in project Crossmodal Learning, TRR 169.
References
 [1] Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6d object pose estimation using 3d object coordinates. In: ECCV (2014)
 [2] Calli, B., Walsman, A., Singh, A., Srinivasa, S., Abbeel, P.: Benchmarking in manipulation research using the yaleemuberkeley object and model set. Robotics & Automation Magazine, IEEE 22(3), 36–52 (2015)
 [3] Chen, Y., Medioni, G.: Object modelling by registration of multiple range images. Image and vision computing 10(3), 145–155 (1992)
 [4] Do, T., Cai, M., Pham, T., Reid, I.: Deep6DPose: Recovering 6D Object Pose from a Single RGB Image. arXiv preprint arXiv:1802.10367 (2018)
 [5] Drost, B., Ulrich, M., Navab, N., Ilic, S.: Model globally, match locally: Efficient and robust 3d object recognition. In: CVPR (2010)
 [6] Hall, B.: Lie groups, Lie algebras, and representations: an elementary introduction. Springer (2015)

[7]
Hartley, R., Trumpf, J., Dai, Y., Li, H.: Rotation averaging. International Journal of Computer Vision
103(3), 267–305 (2013)  [8] Hinterstoisser, S., Holzer, S., Cagniart, C., Ilic, S., Konolige, K., Navab, N., Lepetit, V.: Multimodal templates for realtime detection of textureless objects in heavily cluttered scenes. In: ICCV (2011)
 [9] Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, K., Navab, N.: Model based training, detection and pose estimation of textureless 3d objects in heavily cluttered scenes. In: ACCV (2012)
 [10] Hinterstoisser, S., Lepetit, V., Rajkumar, N., Konolige, K.: Going further with point pair features. In: ECCV (2016)
 [11] Hoag, D.: Apollo guidance and navigation: Considerations of apollo imu gimbal lock. Canbridge: MIT Instrumentation Laboratory pp. 1–64 (1963)
 [12] Hodaň, T., Zabulis, X., Lourakis, M., Obdržálek, S., Matas, J.: Detection and fine 3d pose estimation of textureless objects in rgbd images. In: IROS (2015)
 [13] Huynh, D.Q.: Metrics for 3d rotations: Comparison and analysis. Journal of Mathematical Imaging and Vision 35(2), 155–164 (2009)

[14]
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: NIPS (2015)
 [15] Jafari, H., Mustikovela, S.K., Pertsch, K., Brachmann, E., Rother, C.: iPose : InstanceAware 6 D Pose Estimation of Partly Occluded Objects. arXiv preprint arXiv:1712.01924 (2018)
 [16] Jafari, O.H., Mustikovela, S.K., Pertsch, K., Brachmann, E., Rother, C.: The best of both worlds: Learning geometrybased 6d object pose estimation. arXiv preprint arXiv:1712.01924 (2017)
 [17] Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD6D: Making RGBbased 3D detection and 6D pose estimation great again. In: ICCV (2017)
 [18] Kehl, W., Milletari, F., Tombari, F., Ilic, S., Navab, N.: Deep learning of local rgbd patches for 3d object detection and 6d pose estimation. In: ECCV (2016)
 [19] Kehl, W., Tombari, F., Navab, N., Ilic, S., Lepetit, V.: Hashmod: A Hashing Method for Scalable 3D Object Detection. In: BMVC (2015)
 [20] Krull, A., Brachmann, E., Michel, F., Yang, M.Y., Gumhold, S., Rother, C.: Learning analysisbysynthesis for 6d pose estimation in rgbd images. In: ICCV (2015)
 [21] Li, Y., Wang, G., Ji, X., Xiang, Y., Fox, D.: DeepIM: Deep Iterative Matching for 6D Pose Estimation. arXiv preprint arXiv:1804.00175 (2018)
 [22] Mahendran, S., Ali, H., Vidal, R.: 3d pose regression using convolutional neural networks. In: ICCV (2017)
 [23] Michel, F., Kirillov, A., Brachmann, E., Krull, A., Gumhold, S., Savchynskyy, B., Rother, C.: Global hypothesis generation for 6d object pose estimation. In: CVPR (2017)
 [24] Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: CVPR (2017)
 [25] Rad, M., Lepetit, V.: Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In: ICCV (2017)
 [26] Riegler, G., Ulusoy, A.O., Geiger, A.: Octnet: Learning deep 3d representations at high resolutions. In: CVPR (2017)
 [27] Rusu, R., Bradski, G., Thibaux, R., Hsu, J.: Fast 3d recognition and pose using the viewpoint feature histogram. In: IROS (2010)
 [28] Sedaghat, N., Zolfaghari, M., Amiri, E., Brox, T.: Orientationboosted voxel nets for 3d object recognition. In: BMVC (2017)
 [29] Song, S., Xiao, J.: Deep sliding shapes for amodal 3d object detection in rgbd images. In: CVPR (2016)
 [30] Tejani, A., Kouskouridas, R., Doumanoglou, A., Tang, D., Kim, T.: Latentclass hough forests for 6 dof object pose estimation. PAMI 40(1), 119–132 (2018)
 [31] Tejani, A., Tang, D., Kouskouridas, R., Kim, T.: Latentclass hough forests for 3d object detection and pose estimation. In: ECCV (2014)
 [32] Tekin, B., Sinha, S.N., Fua, P.: Realtime seamless single shot 6d object pose prediction. In: CVPR (2018)
 [33] Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829 (2018)
 [34] Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. In: RSS (2018)