6D Pose Estimation with Correlation Fusion

09/24/2019 ∙ by Yi Cheng, et al. ∙ Agency for Science, Technology and Research 0

6D object pose estimation is widely applied in robotic tasks such as grasping and manipulation. Prior methods using RGB-only images are vulnerable to heavy occlusion and poor illumination, so it is important to complement them with depth information. However, existing methods using RGB-D data don't adequately exploit consistent and complementary information between two modalities. In this paper, we present a novel method to effectively consider the correlation within and across RGB and depth modalities with attention mechanism to learn discriminative multi-modal features. Then, effective fusion strategies for intra- and inter-correlation modules are explored to ensure efficient information flow between RGB and depth. To the best of our knowledge, this is the first work to explore effective intra- and inter-modality fusion in 6D pose estimation and experimental results show that our method can help achieve the state-of-the-art performance on LineMOD and YCB-Video datasets as well as benefit robot grasping task.



There are no comments yet.


page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

6D pose estimation, which aims to predict the 3D rotation and transition from object space to camera space, is useful in 3d object detection and recognition [12, 15], robot grasping and manipulation tasks [24, 22]. However, it remains a challenge as both accuracy and efficiency are required for the real-world applications.

Existing methods could be divided into RGB-only methods and RGB-D based methods. Methods with RGB-only images as input use deep neural networks to either regress 6D pose directly

[28, 11, 13] or detect the 2D projections of 3D key points and then obtain 6D pose by solving a Perspective-n-Point(PnP) problem [18, 16, 19, 5, 23, 24, 21]. Although these methods can achieve fast inference speed and address occlusion to some extent, there are still large gaps compared with RGB-D based methods, as depth map can provide effective complementary information interpreting the object geometrics [31]. Most recent RGB-D based methods predict coarse 6D pose, and then use depth map to refine the previous estimation with iterative-closest point (ICP) algorithm [6, 7, 2, 9, 14, 29]. However, ICP is time-consuming and sensitive to initialization.

Fig. 1: We develop an end-to-end deep network model for 6D pose estimation which performs effective fusion for RGB and depth features with correlation learning for fast and accurate predictions for real-time applications such as robot grasping and manipulation.

To overcome these problems, DenseFusion [26] proposes an RGB-D based deep neural network to consider the visual appearance and geometry structure simultaneously, which is robust to occlusion and achieves real-time inference speed. However, this method does not consider the correlation within and between two modalities to fully exploit the consistent and complementary information among them to learn discriminative features for object pose estimation.

In this paper, we propose a novel Correlation Fusion (CF) framework that models the feature correlation within and between RGB and depth modalities to improve the performance of 6D pose estimation. We propose two modules Intra-modality Correlation Modeling and Inter-modality Correlation Modeling to help select prominent features within and cross two modalities with self-attention. These two modules efficiently model global and long-range dependencies which complements the local convolution operation. Furthermore, different strategies for fusing the intra- and inter-modality information are explored to ensure efficient and effective information flow within and between modalities. Experiments show that pose estimation accuracy can be further improved with the proposed fusion strategies.

The main contributions of our work can be summarized into four parts. Firstly, we propose intra- and inter-correlation modules to exploit the consistent and complementary information within and between image and depth modalities for 6D pose estimation. Secondly, we explore different strategies for effectively fusing the intra- and inter-modality information flow, which are crucial for discriminative multi-modal feature learning. Thirdly, we demonstrate that our proposed method can achieve the state-of-the-art performance on widely-used benchmark datasets for 6D pose estimation, including LineMOD [3] and YCB-Video [28] datasets. Lastly, we showcase its efficacy in a real robot task, where the robot grasps objects with estimated object poses.

Ii Related works

The literature on 3D object detection and pose estimation is extremely large, so we mainly focus on recent works based on machine learning or deep learning techniques.

Pose from RGB images. 6D object pose estimation from a single RGB image has been intensively studied in recent years. Existing methods either perform regression from detection, like PoseCNN [28] or predict 2D projection of predefined 3D key points [18, 16, 19, 5, 23, 24, 21]. The first kind of methods can handle the low-texture and partially occluded objects, however, predictions are sensitive to small errors due to large search space. The keypoint-based methods help address the issue of occlusion, but has difficulty with truncated objects as some of the key points may be outside the input image. Moreover, the aforementioned methods do not utilize the depth information, hence may not be able to disambiguate the objects’ scales due to perspective projection. Our proposed method effectively fuses RGB and depth information for more accurate 6D pose estimation.

Pose from RGB-D images. The performance of 6D pose estimation can be further improved by incorporating depth information. Current RGB-D based approach utilizes depth information mainly in three ways. First, RGB and depth information are used separate stages [6, 28, 7], where coarse 6D pose is predicted from RGB image, followed by ICP algorithm using depth information for refinement. Second, RGB and depth modalities are fused at early stage [2, 9, 14], where depth map is treated as another channel and concatenated with RGB channels. However, these methods fail to utilize the correlation between the two modalities. Meanwhile, the refinement stage of these methods is time-consuming hence they cannot achieve real-time inference speed. Recently, [26] explored to fuse RGB and depth modalities at a late stage. It can achieve state-of-the-art performance while reaching almost real-time inference speed. Instead of direct feature concatenation as in [26], we exploit the consistent and complementary information between two modalities by modeling the intra- and inter-modality correlation with the attention mechanism. Furthermore, we explore different fusion strategies to make the information flow within the framework more efficient.

Attention mechanisms.

Attention mechanisms have been integrated in many deep learning-based computer vision and language processing tasks, such as detection

[10], classification [27] and visual question answering (VQA) [1].There are many variants of the attention mechanisms, among which self-attention [25] has attracted lots of interests, due to its ability to model long-range dependencies while maintaining computational efficiency. Motivated by this work, we propose to integrate Intra- and Inter-modality correlation modelling with self-attention module for efficient fusion of RGB and depth information in 6D pose estimation. We also explore different strategies for more efficient multi-model feature fusion. To the best of our knowledge, this is the first work to explore an efficient fusion of RGB and depth information in 6D object pose estimation with attention mechanism. We show that our proposed method enables efficient exploitation of the context information from both RGB and depth modality and can achieve state-of-the-art accuracy in 6D pose estimation and satisfactory robot grasping performance.

Fig. 2: Overview of Correlation Fusion (CF) framework for 6D pose estimation. The Multi-modality Correlation Learning (MMCL) module contains Intra-modality Correlation Modelling (IntraMCM) and Inter-modality Correlation Modelling (InterMCM) modules. We also explore parallel (Fuse-V1) and sequential (Fuse-V2 and Fuse-V3) strategies to combine the two modules. This helps to efficiently model within and cross modalities dependency to capture the consistent and complementary information for accurate pose estimation.

Iii Methodology

Given an RGB-D image and 3D model of known objects, we aim to predict the 6D object pose which is represented as a transformation in 3D space.

Estimating 6D object pose from RGB image is challenging due to the existence of low-texture, heavy occlusion and varying lighting conditions. Depth information provide extra geometric information to help resolve the problems. However, RGB and depth reside in two different domains. Thus, efficient fusion schemes to keep the modality-specific information as well as the complementary information from the other modalities are necessary for accurate pose estimation.

Fig. 2

illustrates the proposed architecture to solve the aforementioned challenges. The first stage includes semantic segmentation and feature extractions. The second stage is our main focus, which models the intra- and inter-correlation within and between RGB and depth modalities, followed by different strategies of fusing these modules.

Moreover, we have an additional stage which includes an iterative refinement methodology to obtain final 6D pose estimation. We explain the detailed architecture in following subsections.

Iii-a Semantic Segmentation and Feature Extraction

Firstly, we segment the target objects in the image with an existing semantic segmentation architecture presented by [28]

given its efficiency and performance. Specifically, given an image, the segmentation network generates a per-pixel segmentation map to classify each image pixel into an object class. Then the RGB and depth images are cropped with the the bounding box of the predicted object.

Secondly, the cropped RGB and depth images are processed separately to compute color and geometric features. For depth image , the segmented depth pixels are first converted into 3D points with given camera intrinsics. Then, the points are fed to PointNet [20] variant (PNet) to obtain -dimensional geometric features . The cropped RGB image is applied through a CNN-based encoder-decoder architecture to produce -dimensional pixel-wise color features .

Iii-B Multi-modality Correlation Learning

Our proposed Multi-modality Correlation Learning (MMCL) module contains Intra- and Inter-modality Correlation Modelling modules, where the former one aims to extract modality-specific features, while the latter one helps to extract modality-complement features.

Fig. 3: Illustration of the proposed Intra- and Inter-modality Correlation Modelling modules. As these modules can be applied to color and geomtry features respectively in symmetric fashion, hence only the Intra-modality Correlation Modelling (IntraMCM) and Inter-modality Correlation Modelling (InterMCM) with geometric-centric features are shown.

(a) Intra-modality Correlation Modelling (IntraMCM)

The IntraMCM module is proposed to extract modality-specific discriminative features. The implementation of IntraMCM is illustrated in Figure 3.

Firstly, within each modality, features are transformed into query , key and value features with 11 convolutions , and :


where , , are transformed color features, , and are transformed geometric features, are learned weight parameters and represents the common dimension of transformed features from both modalities.

Then, the raw attention weights and from RGB and depth modalities are obtained by computing the inner product and with row-wise softmax as follow:


Then and are used to weight information flow within RGB modality and depth modality respectively as follows:


where we denote the update of color and geometric feature maps as and , respectively.

Then element-wise addition is applied to combine with original color feature maps , which weights by two learnable parameter to obtain final feature maps and this process is similarly applied to the geometric feature :


this allows the model to learn from local information first and gradually assign more weights for the non-local information, the and are initialized as 0 as in [30].

Therefore, the output of the IntraMCM module would capture both local and non-local color-to-color and geometry-to-geometry relations, and hence maintaining prominent modality-specific features for following pose estimation.

(b) Inter-Modality Correlation Modelling (InterMCM)

The implementation of InterMCM is illustrated in Figure 3, which learns to extract modality-complement features. The module first generates two sets of attention maps and in the same way as IntraMCM module. Then, we use the generated attention maps to weight features from the other modality and obtain the updated feature maps as and . A mathematical formulation of the process is defined as follows:


Then the color and geometric feature maps are further updated in the same way as in IntraMCM module,


where and are learnable parameters initialized as 0, same as in IntraMCM module.

Iii-C Multi-modality Fusion Strategies

In our work, we explore different fusion strategies to effectively combine two information flows as illustrated in Figure 2 and the effectiveness of each updating scheme will be elaborated in Section IV.

Parallel update: The IntraMCM and InterMCM modules are applied simultaneously, which is termed as Fuse_V1.

Sequential update: The IntraMCM and InterMCM modules are applied sequentially. Fuse_V2 refers to first performs InterMCM and then IntraMCM; while vice versa for Fuse_V3.

Iii-D Pose Estimation and Refinement

Dense pose prediction. After the color and geometric features are fused, we predict the object’s 6D pose in a pixel-wise dense manner with a confidence score indicating how likely it is to be the ground true object pose. The dense prediction makes our algorithms more robust to occlusion and segmentation faults. During inference, the predicted pose with highest confidence is selected as the final prediction.

Iterative pose refinement. We adopt a refiner network module as in [26] for iterative pose refinement. We integrate the correlation modelling module into the pose refiner network in a same fashion as we applied in main network in Figure. 2. Specifically, at each iteration, we perform pixel-wise fusion of original color features and transformed geometric features with predicted pose in last prediction, and then feed the fused pixel-wise features to pose refiner networks, which outputs residual pose based on the predicted pose from last prediction. After iterations, the final pose estimation is:


Theoretically, pose refiner network can be jointly trained with main network, but we start to train refiner network after the main network converges for efficiency.

Methods PoseCNN[28] DenseFusion[26] OURS (IntraMCM) OURS (InterMCM) OURS (Fuse_V1) OURS (Fuse_V2) OURS (Fuse_V3)
Metrics AUC <2cm AUC <2cm AUC <2cm AUC <2cm AUC <2cm AUC <2cm AUC <2cm
002_master_chef_can 68.06 51.09 73.16 72.56 87.61 88.37 86.94 88.07 86.18 86.28 92.47 98.71 87.24 86.18
003_cracker_box 83.38 73.27 94.21 98.50 94.80 99.54 93.69 99.08 91.79 98.50 95.45 98.62 95.20 99.19
004_sugar_box 97.15 99.49 96.50 100.00 93.73 100.00 95.06 100.00 95.68 100.00 96.69 99.92 96.19 99.58
005_tomato_soup_can 81.77 76.60 85.42 82.99 91.50 95.42 90.23 93.19 92.73 95.56 92.02 95.76 91.51 95.56
006_mustard_bottle 98.01 98.60 94.61 96.36 92.27 98.04 93.10 98.60 89.66 91.04 94.82 97.48 95.29 99.16
007_tuna_fish_can 83.87 72.13 81.88 62.28 80.86 69.69 86.18 84.58 85.94 83.45 88.85 84.15 85.27 86.31
008_pudding_box 96.62 100.00 93.33 98.60 91.69 97.13 91.83 98.60 91.76 99.07 93.16 98.60 94.10 98.13
009_gelatin_box 98.08 100.00 96.68 100.00 95.35 100.00 95.06 100.00 95.92 100.00 95.68 100.00 97.28 100.00
010_potted_meat_can 83.47 77.94 83.54 79.90 85.01 83.55 83.77 80.81 84.07 82.90 86.19 83.94 86.03 84.07
011_banana 91.86 88.13 83.49 88.13 84.70 81.79 90.71 98.68 88.73 98.15 92.57 98.94 86.84 88.92
019_pitcher_base 96.93 97.72 96.78 99.47 95.76 98.02 96.55 100.00 96.07 100.00 95.43 98.42 95.97 99.65
021_bleach_cleanser 92.54 92.71 89.93 90.96 87.93 83.19 89.10 83.28 90.19 89.70 88.99 86.20 89.00 83.28
024_bowl 80.97 54.93 89.50 94.83 88.70 97.78 87.00 84.24 86.32 90.64 86.06 94.33 89.08 95.81
025_mug 81.08 55.19 88.92 89.62 91.84 92.77 92.00 94.97 91.06 91.98 93.51 94.81 93.44 96.38
035_power_drill 97.66 99.24 92.55 96.40 92.05 95.65 86.60 90.35 85.05 87.70 82.89 84.77 93.52 98.20
036_wood_block 87.56 80.17 92.88 100.00 91.44 98.35 90.16 100.00 91.46 99.59 92.32 99.59 92.35 98.76
037_scissors 78.36 49.17 77.89 51.38 91.28 86.37 78.98 67.40 79.25 64.70 90.15 89.50 88.38 86.74
040_large_marker 85.26 87.19 92.95 100.00 93.55 100.00 93.84 100.00 94.10 100.00 93.91 99.85 93.82 99.85
051_large_clamp 75.19 74.86 72.48 78.65 71.27 78.51 72.14 77.95 70.18 75.70 70.31 76.69 73.22 78.65
052_extra_large_clamp 64.38 48.83 69.94 75.07 70.11 76.83 73.74 75.51 69.71 75.22 69.53 74.49 70.80 76.25
061_foam_brick 97.23 100.00 91.95 100.00 94.36 100.00 94.15 100.00 93.08 100.00 94.62 100.00 94.89 100.00
MEAN 86.64 79.87 87.55 88.37 88.85 91.48 88.61 91.21 88.04 90.96 89.79 93.08 89.97 92.89
TABLE I: The 6D pose estimation accuracy on YCB-Video Dataset in terms of the ADD(-S) <2cm and the AUC of ADD(-S). The objects with bold name are considered as symmetric. All the methods use RGB-D images as input.(Best zoom-in and view in pdf.)

Iv Experiments

Iv-a Datasets and Metrics

We compare our method with the state-of-the-art methods on two commonly used YCB-Video [28] and LineMOD datasets [3]. The pose estimation performance is evaluated by using (1) Average distance (ADD) metric [4] and (2) The average closest point distance (ADD-S) metric [28].

The ADD metric is obtained by first transforming the model points with the predicted pose and the ground truth pose , respectively, and then computing the mean of the pairwise distances between two sets of transformed points:


where denotes the 3D model points set and refers to number of points within the points set.

The ADD-S metric [28] is proposed for symmetric objects, where the matching between points sets is ambiguous for some views. ADD-S is defined as:


Iv-B Implementation Details

We implement our method within the PyTorch

[17] framework. All the parameters except specified are initialized with PyTorch default initialization. Our model is trained using Adam optimizer [8] with an initial learning rate at 1e-4. After the loss of estimator network falls under 0.016, then a decay of 0.3 is applied to further train the refiner network. The mini-batch size is set to for estimator network and for refiner network.

[7] [21] [26] (IntraMCM) (InterMCM) (Fuse_V1) (Fuse_V2) (Fuse_V3)

65 40.4 92.3 94.9 95.2 94.8 95.6 95.4
bench 80 91.8 93.2 93.7 94.0 96.1 96.9 96.1
camera 78 55.7 94.4 97.5 95.6 96.0 97.9 97.5
can 86 64.1 93.1 95.4 95.7 92.2 96.0 95.0
cat 70 62.6 96.5 98.4 98.8 99.2 97.8 99.1
driller 73 74.4 87.0 92.2 92.7 91.4 95.6 94.7
duck 66 44.3 92.3 96.2 95.1 95.7 95.7 95.8
eggbox 100 57.8 99.8 100.0 99.6 100.0 99.9 99.9
glue 100 41.2 100.0 99.8 99.8 99.8 99.7 99.8
hole 49 67.2 92.1 95.2 95.6 95.8 96.7 97.1
iron 78 84.7 97.0 95.8 96.2 97.4 97.8 98.4
lamp 73 76.5 95.3 95.4 96.3 96.5 97.0 96.8
phone 79 54.0 92.8 97.3 97.5 95.6 97.0 97.4
MEAN 77 62.7 94.3 96.3 96.3 96.2 97.2 97.1
TABLE II: The 6D pose estimation accuracy on the LINEMOD Dataset in terms of the ADD(-S) metric. The objects with bold name (glue and eggbox) are considered as symmetric. All the methods use RGB-D images as input.

Iv-C Experiment Analysis

Iv-C1 Ablation Study

In this section, we first perform ablation study to verify the necessity of each component in our proposed framework, including IntraMCM, InterMCM and correlation fusion modules on both YCB-Video (Table I) and LineMOD dataset (Table II). From the tables, one can observe that either using IntraMCM or InterMCM alone can improve the performance as they capture discriminative intra- and inter-modality features.

Besides the two correlation modelling modules, we also explore different fusion schemes for effectively fusing the information flow within and between two modalities. According to the orders of information passing, we design three fusion strategies:Fuse_V1, Fuse_V2 and Fuse_V3 (introduced in Section III-C). From the results on both datasets, Fuse_V1 has slightly worse performance than Intra-only and Inter-only method, we conjecture it is caused by over-fitting. Meanwhile, Fuse_V2 and Fuse_V3 outperform the Intra-only, Inter-only and Parallel methods, which indicates that sequential updating is a better way to handle feature fusion, while the specific order of updating has less influence on prediction performance.

Fig. 4: Visualizations of results on the YCB-Video Dataset. The first row is original RGB image, the second row is from DenseFusion, and third row is our proposed method Fuse_V2.

Iv-C2 Comparison with State-of-the-Arts

We also compare our method with the state-of-the-art methods which take RGB-D images as input and output 6D object poses on YCB-Video and LineMOD dataset.

Results on YCB-Video dataset. The results in terms of ADD(-S) AUC and ADD(-S) <2cm metrics are presented in Table I. For both metrics, our method is superior to the state-of-the-art methods [28, 26]. In particular, our method outperforms PoseCNN [28] by a margin of 13.21% and [26] by 4.71% in terms of the ADD(-S) <2cm metric.

Results on LineMOD dataset. Table II summarizes the comparison with [7, 21, 26] in terms of ADD(-S) metric on LineMOD dataset. SSD-6D [7] and BB8 [21] obtain initial 6D pose estimation with RGB image as input and then use depth image for pose refinement, while DenseFusion [26] takes RGB and depth images for both pose estimation and pose refinement. Comparing with these methods which ignore the correlation information from RGB and depth modalities, our proposed method achieves the best performance, as shown in Table II.

Iv-D Efficiency and Qualitative Results

The running time of our full model 6D pose is 49.8ms on average, including 23.6ms for the semantic segmentation forward propagation, 17.3ms for pose estimation forward propagation, and 8.9ms for the forward propagation of refiner on single Nvidia GTX 2080ti GPU. Thus, our method can run in real-time on GPU at around 20fps.

In Figure 4, we present some qualitative results on the YCB-Video dataset, from both DenseFusion [26] and our proposed method. Our proposed method is more accurate under heavy occlusions, as shown by potted meat can in the first (from left to right) column. Moreover, our proposed method can generate more accurate predictions for symmetric objects, like large clamp and foam brick in the second and third column respectively. In the last column, we show that both methods fail at predicting the 6D pose for bowl, which is a symmetric object under heavy occlusion.

Iv-E Robotic Grasping Experiments

We carry out robotic grasping experiments in both simulation and real world to demonstrate that our algorithm is effective for robot grasping tasks. More visualization results are presented in the submitted video.

Grasping in simulation. We compare the proposed method with DenseFusion [26] in Gazebo simulation environment. We retrain both models with data collected from the environment. We place four objects from YCB-Video dataset in five random locations and four random orientations on the table. The robot arm aligns gripper with the predicted object pose to grasp the target object. The robot arm makes 20 attempts to grasp each object, with 80 grasps in total for each comparing method. The results are shown in Table III. Thanks to the correlation fusion framework, our method has a significant higher pick up success rate than [26].

Success Attempts (%) tomato_soup_can mustard_bottle banana bleach_cleanser
DenseFusion [26] 80.0 70.0 55.0 65.0
Ours 90.0 85.0 75.0 80.0
TABLE III: Success rate for the grasping experiments with robotic arm in simulation environment of Gazebo.

Grasping in real world. We also apply our algorithm to real world robot task, where the robot arm is used to pick up the objects from table. Without further fine-tuning on real testing data, our model can predict accurate enough object pose for the grasping task. More visualization results are presented in the submitted video.

V Conclusion

In this paper, we proposed a novel Correlation Fusion framework with intra- and inter-modality correlation learning for 6D object pose estimation. The IntraMCM module helps to learn prominent modality-specific features and the InterMCM module helps to capture complement modality features. Then, different fusion schemes are explored to further improve the performance on 6D pose estimation. Intensive experiments on YCB, LINEMOD dataset and real robot grasping task demonstrate the superior performance of our method.

Vi Acknowledgement

This research is supported by the Agency for Science, Technology and Research (A*STAR) under its AME Programmatic Funding Scheme (Project A18A2b0046).


  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018)

    Bottom-up and top-down attention for image captioning and visual question answering


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6077–6086. Cited by: §II.
  • [2] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother (2014) Learning 6d object pose estimation using 3d object coordinates. In European conference on computer vision, pp. 536–551. Cited by: §I, §II.
  • [3] S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Konolige, N. Navab, and V. Lepetit (2011) Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In 2011 international conference on computer vision, pp. 858–865. Cited by: §I, §IV-A.
  • [4] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab (2012) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian conference on computer vision, pp. 548–562. Cited by: §IV-A.
  • [5] Y. Hu, J. Hugonot, P. Fua, and M. Salzmann (2018) Segmentation-driven 6d object pose estimation. arXiv preprint arXiv:1812.02541. Cited by: §I, §II.
  • [6] O. H. Jafari, S. K. Mustikovela, K. Pertsch, E. Brachmann, and C. Rother (2018) IPose: instance-aware 6d pose estimation of partly occluded objects. In Asian Conference on Computer Vision, pp. 477–492. Cited by: §I, §II.
  • [7] W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab (2017) SSD-6d: making rgb-based 3d detection and 6d pose estimation great again. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1521–1529. Cited by: §I, §II, §IV-C2, TABLE II.
  • [8] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-B.
  • [9] A. Krull, E. Brachmann, F. Michel, M. Ying Yang, S. Gumhold, and C. Rother (2015) Learning analysis-by-synthesis for 6d pose estimation in rgb-d images. In Proceedings of the IEEE International Conference on Computer Vision, pp. 954–962. Cited by: §I, §II.
  • [10] H. Li, Y. Liu, W. Ouyang, and X. Wang (2019) Zoom out-and-in network with map attention decision for region proposal and object detection. International Journal of Computer Vision 127 (3), pp. 225–238. Cited by: §II.
  • [11] Y. Li, G. Wang, X. Ji, Y. Xiang, and D. Fox (2018) Deepim: deep iterative matching for 6d pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 683–698. Cited by: §I.
  • [12] M. R. Loghmani, M. Planamente, B. Caputo, and M. Vincze (2019) Recurrent convolutional fusion for RGB-D object recognition. IEEE Robotics and Automation Letters 4 (3), pp. 2878–2885. Cited by: §I.
  • [13] F. Manhardt, W. Kehl, N. Navab, and F. Tombari (2018) Deep model-based 6d pose refinement in rgb. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 800–815. Cited by: §I.
  • [14] F. Michel, A. Kirillov, E. Brachmann, A. Krull, S. Gumhold, B. Savchynskyy, and C. Rother (2017) Global hypothesis generation for 6d object pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 462–471. Cited by: §I, §II.
  • [15] A. Mousavian, D. Anguelov, J. Flynn, and J. Kosecka (2017) 3D bounding box estimation using deep learning and geometry. In CVPR, pp. 5632–5640. Cited by: §I.
  • [16] M. Oberweger, M. Rad, and V. Lepetit (2018) Making deep heatmaps robust to partial occlusions for 3d object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 119–134. Cited by: §I, §II.
  • [17] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §IV-B.
  • [18] G. Pavlakos, X. Zhou, A. Chan, K. G. Derpanis, and K. Daniilidis (2017) 6-dof object pose from semantic keypoints. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2011–2018. Cited by: §I, §II.
  • [19] S. Peng, Y. Liu, Q. Huang, H. Bao, and X. Zhou (2018) PVNet: pixel-wise voting network for 6dof pose estimation. arXiv preprint arXiv:1812.11788. Cited by: §I, §II.
  • [20] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660. Cited by: §III-A.
  • [21] M. Rad and V. Lepetit (2017) BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3836. Cited by: §I, §II, §IV-C2, TABLE II.
  • [22] C. Rennie, R. Shome, K. E. Bekris, and A. F. De Souza (2016) A dataset for improved rgbd-based object detection and pose estimation for warehouse pick-and-place. IEEE Robotics and Automation Letters 1 (2), pp. 1179–1185. Cited by: §I.
  • [23] B. Tekin, S. N. Sinha, and P. Fua (2018) Real-time seamless single shot 6d object pose prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 292–301. Cited by: §I, §II.
  • [24] J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, and S. Birchfield (2018) Deep object pose estimation for semantic robotic grasping of household objects. In Conference on Robot Learning, pp. 306–316. Cited by: §I, §I, §II.
  • [25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §II.
  • [26] C. Wang, D. Xu, Y. Zhu, R. Martín-Martín, C. Lu, L. Fei-Fei, and S. Savarese (2019) DenseFusion: 6d object pose estimation by iterative dense fusion. Cited by: §I, §II, §III-D, TABLE I, §IV-C2, §IV-C2, §IV-D, §IV-E, TABLE II, TABLE III.
  • [27] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang (2017) Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. Cited by: §II.
  • [28] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox (2017)

    Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes

    arXiv preprint arXiv:1711.00199. Cited by: §I, §I, §II, §II, §III-A, TABLE I, §IV-A, §IV-A, §IV-C2.
  • [29] A. Zeng, K. Yu, S. Song, D. Suo, E. Walker, A. Rodriguez, and J. Xiao (2017) Multi-view self-supervised deep learning for 6d pose estimation in the amazon picking challenge. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1386–1383. Cited by: §I.
  • [30] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena (2018) Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318. Cited by: §III.
  • [31] H. Zhu, J. Weibel, and S. Lu (2016)

    Discriminative multi-modal feature fusion for rgbd indoor scene recognition

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2969–2976. Cited by: §I.