6D pose estimation, which aims to predict the 3D rotation and transition from object space to camera space, is useful in 3d object detection and recognition [12, 15], robot grasping and manipulation tasks [24, 22]. However, it remains a challenge as both accuracy and efficiency are required for the real-world applications.
Existing methods could be divided into RGB-only methods and RGB-D based methods. Methods with RGB-only images as input use deep neural networks to either regress 6D pose directly[28, 11, 13] or detect the 2D projections of 3D key points and then obtain 6D pose by solving a Perspective-n-Point(PnP) problem [18, 16, 19, 5, 23, 24, 21]. Although these methods can achieve fast inference speed and address occlusion to some extent, there are still large gaps compared with RGB-D based methods, as depth map can provide effective complementary information interpreting the object geometrics . Most recent RGB-D based methods predict coarse 6D pose, and then use depth map to refine the previous estimation with iterative-closest point (ICP) algorithm [6, 7, 2, 9, 14, 29]. However, ICP is time-consuming and sensitive to initialization.
To overcome these problems, DenseFusion  proposes an RGB-D based deep neural network to consider the visual appearance and geometry structure simultaneously, which is robust to occlusion and achieves real-time inference speed. However, this method does not consider the correlation within and between two modalities to fully exploit the consistent and complementary information among them to learn discriminative features for object pose estimation.
In this paper, we propose a novel Correlation Fusion (CF) framework that models the feature correlation within and between RGB and depth modalities to improve the performance of 6D pose estimation. We propose two modules Intra-modality Correlation Modeling and Inter-modality Correlation Modeling to help select prominent features within and cross two modalities with self-attention. These two modules efficiently model global and long-range dependencies which complements the local convolution operation. Furthermore, different strategies for fusing the intra- and inter-modality information are explored to ensure efficient and effective information flow within and between modalities. Experiments show that pose estimation accuracy can be further improved with the proposed fusion strategies.
The main contributions of our work can be summarized into four parts. Firstly, we propose intra- and inter-correlation modules to exploit the consistent and complementary information within and between image and depth modalities for 6D pose estimation. Secondly, we explore different strategies for effectively fusing the intra- and inter-modality information flow, which are crucial for discriminative multi-modal feature learning. Thirdly, we demonstrate that our proposed method can achieve the state-of-the-art performance on widely-used benchmark datasets for 6D pose estimation, including LineMOD  and YCB-Video  datasets. Lastly, we showcase its efficacy in a real robot task, where the robot grasps objects with estimated object poses.
Ii Related works
Pose from RGB images. 6D object pose estimation from a single RGB image has been intensively studied in recent years. Existing methods either perform regression from detection, like PoseCNN  or predict 2D projection of predefined 3D key points [18, 16, 19, 5, 23, 24, 21]. The first kind of methods can handle the low-texture and partially occluded objects, however, predictions are sensitive to small errors due to large search space. The keypoint-based methods help address the issue of occlusion, but has difficulty with truncated objects as some of the key points may be outside the input image. Moreover, the aforementioned methods do not utilize the depth information, hence may not be able to disambiguate the objects’ scales due to perspective projection. Our proposed method effectively fuses RGB and depth information for more accurate 6D pose estimation.
Pose from RGB-D images. The performance of 6D pose estimation can be further improved by incorporating depth information. Current RGB-D based approach utilizes depth information mainly in three ways. First, RGB and depth information are used separate stages [6, 28, 7], where coarse 6D pose is predicted from RGB image, followed by ICP algorithm using depth information for refinement. Second, RGB and depth modalities are fused at early stage [2, 9, 14], where depth map is treated as another channel and concatenated with RGB channels. However, these methods fail to utilize the correlation between the two modalities. Meanwhile, the refinement stage of these methods is time-consuming hence they cannot achieve real-time inference speed. Recently,  explored to fuse RGB and depth modalities at a late stage. It can achieve state-of-the-art performance while reaching almost real-time inference speed. Instead of direct feature concatenation as in , we exploit the consistent and complementary information between two modalities by modeling the intra- and inter-modality correlation with the attention mechanism. Furthermore, we explore different fusion strategies to make the information flow within the framework more efficient.
Attention mechanisms have been integrated in many deep learning-based computer vision and language processing tasks, such as detection, classification  and visual question answering (VQA) .There are many variants of the attention mechanisms, among which self-attention  has attracted lots of interests, due to its ability to model long-range dependencies while maintaining computational efficiency. Motivated by this work, we propose to integrate Intra- and Inter-modality correlation modelling with self-attention module for efficient fusion of RGB and depth information in 6D pose estimation. We also explore different strategies for more efficient multi-model feature fusion. To the best of our knowledge, this is the first work to explore an efficient fusion of RGB and depth information in 6D object pose estimation with attention mechanism. We show that our proposed method enables efficient exploitation of the context information from both RGB and depth modality and can achieve state-of-the-art accuracy in 6D pose estimation and satisfactory robot grasping performance.
Given an RGB-D image and 3D model of known objects, we aim to predict the 6D object pose which is represented as a transformation in 3D space.
Estimating 6D object pose from RGB image is challenging due to the existence of low-texture, heavy occlusion and varying lighting conditions. Depth information provide extra geometric information to help resolve the problems. However, RGB and depth reside in two different domains. Thus, efficient fusion schemes to keep the modality-specific information as well as the complementary information from the other modalities are necessary for accurate pose estimation.
illustrates the proposed architecture to solve the aforementioned challenges. The first stage includes semantic segmentation and feature extractions. The second stage is our main focus, which models the intra- and inter-correlation within and between RGB and depth modalities, followed by different strategies of fusing these modules.
Moreover, we have an additional stage which includes an iterative refinement methodology to obtain final 6D pose estimation. We explain the detailed architecture in following subsections.
Iii-a Semantic Segmentation and Feature Extraction
Firstly, we segment the target objects in the image with an existing semantic segmentation architecture presented by 
given its efficiency and performance. Specifically, given an image, the segmentation network generates a per-pixel segmentation map to classify each image pixel into an object class. Then the RGB and depth images are cropped with the the bounding box of the predicted object.
Secondly, the cropped RGB and depth images are processed separately to compute color and geometric features. For depth image , the segmented depth pixels are first converted into 3D points with given camera intrinsics. Then, the points are fed to PointNet  variant (PNet) to obtain -dimensional geometric features . The cropped RGB image is applied through a CNN-based encoder-decoder architecture to produce -dimensional pixel-wise color features .
Iii-B Multi-modality Correlation Learning
Our proposed Multi-modality Correlation Learning (MMCL) module contains Intra- and Inter-modality Correlation Modelling modules, where the former one aims to extract modality-specific features, while the latter one helps to extract modality-complement features.
(a) Intra-modality Correlation Modelling (IntraMCM)
The IntraMCM module is proposed to extract modality-specific discriminative features. The implementation of IntraMCM is illustrated in Figure 3.
Firstly, within each modality, features are transformed into query , key and value features with 11 convolutions , and :
where , , are transformed color features, , and are transformed geometric features, are learned weight parameters and represents the common dimension of transformed features from both modalities.
Then, the raw attention weights and from RGB and depth modalities are obtained by computing the inner product and with row-wise softmax as follow:
Then and are used to weight information flow within RGB modality and depth modality respectively as follows:
where we denote the update of color and geometric feature maps as and , respectively.
Then element-wise addition is applied to combine with original color feature maps , which weights by two learnable parameter to obtain final feature maps and this process is similarly applied to the geometric feature :
this allows the model to learn from local information first and gradually assign more weights for the non-local information, the and are initialized as 0 as in .
Therefore, the output of the IntraMCM module would capture both local and non-local color-to-color and geometry-to-geometry relations, and hence maintaining prominent modality-specific features for following pose estimation.
(b) Inter-Modality Correlation Modelling (InterMCM)
The implementation of InterMCM is illustrated in Figure 3, which learns to extract modality-complement features. The module first generates two sets of attention maps and in the same way as IntraMCM module. Then, we use the generated attention maps to weight features from the other modality and obtain the updated feature maps as and . A mathematical formulation of the process is defined as follows:
Then the color and geometric feature maps are further updated in the same way as in IntraMCM module,
where and are learnable parameters initialized as 0, same as in IntraMCM module.
Iii-C Multi-modality Fusion Strategies
In our work, we explore different fusion strategies to effectively combine two information flows as illustrated in Figure 2 and the effectiveness of each updating scheme will be elaborated in Section IV.
Parallel update: The IntraMCM and InterMCM modules are applied simultaneously, which is termed as Fuse_V1.
Sequential update: The IntraMCM and InterMCM modules are applied sequentially. Fuse_V2 refers to first performs InterMCM and then IntraMCM; while vice versa for Fuse_V3.
Iii-D Pose Estimation and Refinement
Dense pose prediction. After the color and geometric features are fused, we predict the object’s 6D pose in a pixel-wise dense manner with a confidence score indicating how likely it is to be the ground true object pose. The dense prediction makes our algorithms more robust to occlusion and segmentation faults. During inference, the predicted pose with highest confidence is selected as the final prediction.
Iterative pose refinement. We adopt a refiner network module as in  for iterative pose refinement. We integrate the correlation modelling module into the pose refiner network in a same fashion as we applied in main network in Figure. 2. Specifically, at each iteration, we perform pixel-wise fusion of original color features and transformed geometric features with predicted pose in last prediction, and then feed the fused pixel-wise features to pose refiner networks, which outputs residual pose based on the predicted pose from last prediction. After iterations, the final pose estimation is:
Theoretically, pose refiner network can be jointly trained with main network, but we start to train refiner network after the main network converges for efficiency.
|Methods||PoseCNN||DenseFusion||OURS (IntraMCM)||OURS (InterMCM)||OURS (Fuse_V1)||OURS (Fuse_V2)||OURS (Fuse_V3)|
Iv-a Datasets and Metrics
We compare our method with the state-of-the-art methods on two commonly used YCB-Video  and LineMOD datasets . The pose estimation performance is evaluated by using (1) Average distance (ADD) metric  and (2) The average closest point distance (ADD-S) metric .
The ADD metric is obtained by first transforming the model points with the predicted pose and the ground truth pose , respectively, and then computing the mean of the pairwise distances between two sets of transformed points:
where denotes the 3D model points set and refers to number of points within the points set.
The ADD-S metric  is proposed for symmetric objects, where the matching between points sets is ambiguous for some views. ADD-S is defined as:
Iv-B Implementation Details
We implement our method within the PyTorch framework. All the parameters except specified are initialized with PyTorch default initialization. Our model is trained using Adam optimizer  with an initial learning rate at 1e-4. After the loss of estimator network falls under 0.016, then a decay of 0.3 is applied to further train the refiner network. The mini-batch size is set to for estimator network and for refiner network.
Iv-C Experiment Analysis
Iv-C1 Ablation Study
In this section, we first perform ablation study to verify the necessity of each component in our proposed framework, including IntraMCM, InterMCM and correlation fusion modules on both YCB-Video (Table I) and LineMOD dataset (Table II). From the tables, one can observe that either using IntraMCM or InterMCM alone can improve the performance as they capture discriminative intra- and inter-modality features.
Besides the two correlation modelling modules, we also explore different fusion schemes for effectively fusing the information flow within and between two modalities. According to the orders of information passing, we design three fusion strategies:Fuse_V1, Fuse_V2 and Fuse_V3 (introduced in Section III-C). From the results on both datasets, Fuse_V1 has slightly worse performance than Intra-only and Inter-only method, we conjecture it is caused by over-fitting. Meanwhile, Fuse_V2 and Fuse_V3 outperform the Intra-only, Inter-only and Parallel methods, which indicates that sequential updating is a better way to handle feature fusion, while the specific order of updating has less influence on prediction performance.
Iv-C2 Comparison with State-of-the-Arts
We also compare our method with the state-of-the-art methods which take RGB-D images as input and output 6D object poses on YCB-Video and LineMOD dataset.
Results on YCB-Video dataset. The results in terms of ADD(-S) AUC and ADD(-S) <2cm metrics are presented in Table I. For both metrics, our method is superior to the state-of-the-art methods [28, 26]. In particular, our method outperforms PoseCNN  by a margin of 13.21% and  by 4.71% in terms of the ADD(-S) <2cm metric.
Results on LineMOD dataset. Table II summarizes the comparison with [7, 21, 26] in terms of ADD(-S) metric on LineMOD dataset. SSD-6D  and BB8  obtain initial 6D pose estimation with RGB image as input and then use depth image for pose refinement, while DenseFusion  takes RGB and depth images for both pose estimation and pose refinement. Comparing with these methods which ignore the correlation information from RGB and depth modalities, our proposed method achieves the best performance, as shown in Table II.
Iv-D Efficiency and Qualitative Results
The running time of our full model 6D pose is 49.8ms on average, including 23.6ms for the semantic segmentation forward propagation, 17.3ms for pose estimation forward propagation, and 8.9ms for the forward propagation of refiner on single Nvidia GTX 2080ti GPU. Thus, our method can run in real-time on GPU at around 20fps.
In Figure 4, we present some qualitative results on the YCB-Video dataset, from both DenseFusion  and our proposed method. Our proposed method is more accurate under heavy occlusions, as shown by potted meat can in the first (from left to right) column. Moreover, our proposed method can generate more accurate predictions for symmetric objects, like large clamp and foam brick in the second and third column respectively. In the last column, we show that both methods fail at predicting the 6D pose for bowl, which is a symmetric object under heavy occlusion.
Iv-E Robotic Grasping Experiments
We carry out robotic grasping experiments in both simulation and real world to demonstrate that our algorithm is effective for robot grasping tasks. More visualization results are presented in the submitted video.
Grasping in simulation. We compare the proposed method with DenseFusion  in Gazebo simulation environment. We retrain both models with data collected from the environment. We place four objects from YCB-Video dataset in five random locations and four random orientations on the table. The robot arm aligns gripper with the predicted object pose to grasp the target object. The robot arm makes 20 attempts to grasp each object, with 80 grasps in total for each comparing method. The results are shown in Table III. Thanks to the correlation fusion framework, our method has a significant higher pick up success rate than .
|Success Attempts (%)||tomato_soup_can||mustard_bottle||banana||bleach_cleanser|
Grasping in real world. We also apply our algorithm to real world robot task, where the robot arm is used to pick up the objects from table. Without further fine-tuning on real testing data, our model can predict accurate enough object pose for the grasping task. More visualization results are presented in the submitted video.
In this paper, we proposed a novel Correlation Fusion framework with intra- and inter-modality correlation learning for 6D object pose estimation. The IntraMCM module helps to learn prominent modality-specific features and the InterMCM module helps to capture complement modality features. Then, different fusion schemes are explored to further improve the performance on 6D pose estimation. Intensive experiments on YCB, LINEMOD dataset and real robot grasping task demonstrate the superior performance of our method.
This research is supported by the Agency for Science, Technology and Research (A*STAR) under its AME Programmatic Funding Scheme (Project A18A2b0046).
Bottom-up and top-down attention for image captioning and visual question answering. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086. Cited by: §II.
-  (2014) Learning 6d object pose estimation using 3d object coordinates. In European conference on computer vision, pp. 536–551. Cited by: §I, §II.
-  (2011) Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In 2011 international conference on computer vision, pp. 858–865. Cited by: §I, §IV-A.
-  (2012) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian conference on computer vision, pp. 548–562. Cited by: §IV-A.
-  (2018) Segmentation-driven 6d object pose estimation. arXiv preprint arXiv:1812.02541. Cited by: §I, §II.
-  (2018) IPose: instance-aware 6d pose estimation of partly occluded objects. In Asian Conference on Computer Vision, pp. 477–492. Cited by: §I, §II.
-  (2017) SSD-6d: making rgb-based 3d detection and 6d pose estimation great again. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1521–1529. Cited by: §I, §II, §IV-C2, TABLE II.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-B.
-  (2015) Learning analysis-by-synthesis for 6d pose estimation in rgb-d images. In Proceedings of the IEEE International Conference on Computer Vision, pp. 954–962. Cited by: §I, §II.
-  (2019) Zoom out-and-in network with map attention decision for region proposal and object detection. International Journal of Computer Vision 127 (3), pp. 225–238. Cited by: §II.
-  (2018) Deepim: deep iterative matching for 6d pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 683–698. Cited by: §I.
-  (2019) Recurrent convolutional fusion for RGB-D object recognition. IEEE Robotics and Automation Letters 4 (3), pp. 2878–2885. Cited by: §I.
-  (2018) Deep model-based 6d pose refinement in rgb. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 800–815. Cited by: §I.
-  (2017) Global hypothesis generation for 6d object pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 462–471. Cited by: §I, §II.
-  (2017) 3D bounding box estimation using deep learning and geometry. In CVPR, pp. 5632–5640. Cited by: §I.
-  (2018) Making deep heatmaps robust to partial occlusions for 3d object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 119–134. Cited by: §I, §II.
-  (2017) Automatic differentiation in pytorch. Cited by: §IV-B.
-  (2017) 6-dof object pose from semantic keypoints. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2011–2018. Cited by: §I, §II.
-  (2018) PVNet: pixel-wise voting network for 6dof pose estimation. arXiv preprint arXiv:1812.11788. Cited by: §I, §II.
-  (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §III-A.
-  (2017) BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3836. Cited by: §I, §II, §IV-C2, TABLE II.
-  (2016) A dataset for improved rgbd-based object detection and pose estimation for warehouse pick-and-place. IEEE Robotics and Automation Letters 1 (2), pp. 1179–1185. Cited by: §I.
-  (2018) Real-time seamless single shot 6d object pose prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 292–301. Cited by: §I, §II.
-  (2018) Deep object pose estimation for semantic robotic grasping of household objects. In Conference on Robot Learning, pp. 306–316. Cited by: §I, §I, §II.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §II.
-  (2019) DenseFusion: 6d object pose estimation by iterative dense fusion. Cited by: §I, §II, §III-D, TABLE I, §IV-C2, §IV-C2, §IV-D, §IV-E, TABLE II, TABLE III.
-  (2017) Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. Cited by: §II.
Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199. Cited by: §I, §I, §II, §II, §III-A, TABLE I, §IV-A, §IV-A, §IV-C2.
-  (2017) Multi-view self-supervised deep learning for 6d pose estimation in the amazon picking challenge. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1386–1383. Cited by: §I.
-  (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: §III.
Discriminative multi-modal feature fusion for rgbd indoor scene recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2969–2976. Cited by: §I.