CullNet: Calibrated and Pose Aware Confidence Scores for Object Pose Estimation

by   Kartik Gupta, et al.
Australian National University

We present a new approach for a single view, image-based object pose estimation. Specifically, the problem of culling false positives among several pose proposal estimates is addressed in this paper. Our proposed approach targets the problem of inaccurate confidence values predicted by CNNs which is used by many current methods to choose a final object pose prediction. We present a network called CullNet, solving this task. CullNet takes pairs of pose masks rendered from a 3D model and cropped regions in the original image as input. This is then used to calibrate the confidence scores of the pose proposals. This new set of confidence scores is found to be significantly more reliable for accurate object pose estimation as shown by our results. Our experimental results on multiple challenging datasets (LINEMOD and Occlusion LINEMOD) reflects the utility of our proposed method. Our overall pose estimation pipeline outperforms state-of-the-art object pose estimation methods on these standard object pose estimation datasets. Our code is publicly available on



There are no comments yet.


page 4


Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image

Although significant improvement has been achieved in 3D human pose esti...

HybridPose: 6D Object Pose Estimation under Hybrid Representations

We introduce HybridPose, a novel 6D object pose estimation approach. Hyb...

Robust 6D Object Pose Estimation by Learning RGB-D Features

Accurate 6D object pose estimation is fundamental to robotic manipulatio...

Can You Trust Your Pose? Confidence Estimation in Visual Localization

Camera pose estimation in large-scale environments is still an open ques...

DProST: 6-DoF Object Pose Estimation Using Space Carving and Dynamic Projective Spatial Transformer

Predicting the pose of an object is a core computer vision task. Most de...

Occlusion-Aware Self-Supervised Monocular 6D Object Pose Estimation

6D object pose estimation is a fundamental yet challenging problem in co...

PDC-Net+: Enhanced Probabilistic Dense Correspondence Network

Establishing robust and accurate correspondences between a pair of image...

Code Repositories


Code implementation of our paper "CullNet: Calibrated and Pose Aware Confidence Scores for Object Pose Estimation"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object pose estimation is crucial for machines to interact with or manipulate objects in a meaningful way. It has applications in various areas such as augmented reality, virtual reality, autonomous driving, and robotics. The challenges to be dealt with are not trivial; background clutter, occlusions, textureless objects, and an often ill-posed formulation where small changes in rotation, translation, or scale can be confused with each other. This paper centers around the particular problem of recovering the six degrees of freedom pose of an object, i.e., rotation and translation in 3D with respect to the camera, dealing with the above-mentioned challenges.

Here, we address the problem of 6-DoF object pose estimation with respect to the camera using an RGB image, and corresponding 3D mesh models of object classes of interest. Specifically, each test image consists of a cluttered environment with a single instance of a textureless object class for which the pose with respect to the camera needs to be estimated. We address this problem on datasets particularly having just a couple of hundred training images with given object pose annotations with respect to the camera. To augment the training data, available 3D mesh models are thus rendered with several different pose variations.

Figure 1: Comparisons of pose proposal confidence output of the Keypoint proposal network and CullNet. (a) Comparison of confidence scores for the ‘Duck’ class in the LINEMOD dataset. (b) Comparison of confidence scores for all classes in the LINEMOD dataset.

Overall, this work presents a new approach to predicting several object pose proposals in terms of 2D keypoints, followed by a method to score these proposals. To accomplish this, a fixed number of 3D keypoints are first selected from the object mesh model vertices in the object-centric coordinate system. Given the ground truth pose of the object in each training image, a CNN based on YOLOv3 [20] architecture is trained to predict the 2D projections of these keypoints. Among several sets of keypoints predicted by this CNN, we select the top- most confident set of keypoints based on their confidence score produced by YOLOv3 and compute the pose for each set of 2D-3D keypoint correspondences using the E-PnP [10] algorithm. The object mesh models are then rendered with the predicted pose estimates to estimate the segmentation masks of the object class of interest. This segmentation mask is tightly cropped along with the input image to form a 4-channel input for our final CNN, i.e. CullNet to find the calibrated confidence scores, used for selecting the most accurate pose proposal. The above two CNNs, in concert, address the object pose estimation problem more accurately because object pose estimation is highly dependent on accurate keypoint proposals. Thus, in this work, many sets of object keypoint proposals are predicted, amongst which an accurate candidate is likely to exist. Via our scoring mechanism, the most accurate keypoint proposal can then be selected as the final prediction.

Recent methods [8, 23, 16]

also use deep learning-based methods to predict several pose hypotheses. However, these methods rely on the same backbone network to produce both the pose hypotheses and the confidence measure. Selecting the final pose prediction from this set of hypotheses using the object confidence measures predicted by the same network is undesirable. The reason for this is that the object confidences predicted by the keypoint proposal network do not contain any estimate on how accurate the respective proposed pose is. Thus, we present a new way of re-estimating the object pose confidence measures with an approach that also takes into account knowledge of proposed object poses. We refer to these new scores as calibrated pose proposal confidences. The advantage of using calibrated confidence scores is clearly illustrated in Figure 

1. Fig. 0(a) and Fig. 0(b) compare the distribution of the backbone (keypoint proposal) network or CullNet confidence scores vs. ground truth pose proposal confidence scores for the top- most confident proposals of all111To simplify the plot, we randomly sample 3000 confidence outputs from the set of top- most confident proposals of all test images. test images of the ‘Duck’ class and all classes together in the LINEMOD dataset respectively. Confidence scores produced by CullNet are more correlated to ground truth confidence scores than scores produced by the keypoint proposal network.

Similar to recent keypoint based methods [23, 18, 16], our approach first predicts the 2D keypoints in the RGB images in an end-to-end learning framework. To accomplish this, we employ the backbone architecture of YOLOv3 [20] for the prediction of a set of 2D keypoints. YOLOv3 is one of the fastest object detectors, producing many object proposals for object detection. We amended the original YOLOv3 to predict 2D keypoints rather than object bounding boxes. Then, our proposed method, called CullNet, crops image patches around the top- most confident keypoint proposals predicted by the backbone network, along with crops of their corresponding, proposed, pose rendered masks. This is used to predict the calibrated confidence measure which can be used either for non-maximum suppression for multi-object pose estimation, or arg-max suppression for single object pose estimation.

Our main contributions are three-fold. i) a new method to calibrate the pose proposal confidences using the knowledge of the corresponding predicted pose, called CullNet, ii) a new keypoint proposal method based on YOLOv3 [20], which follows a feature pyramid network to predict many sets of keypoint proposals at multiple scales, and iii) an extensive set of evaluations, producing a new state-of-the-art, on the standard benchmark pose estimation datasets: LINEMOD and Occlusion LINEMOD.

2 Related Work

Object pose estimation was popularly addressed using keypoint-based methods for a long time [14, 24, 22]. However, these methods lack the ability to handle textureless objects as their feature representations require texture information. Recent deep learning based methods try to solve this using CNNs. The solution of the problem requires CNNs to output pose in terms of 3D rotation and 3D translation which has been achieved in different ways.

Direct Pose Prediction

One way to deal with this is to let the network directly predict the 3D rotation and 3D translation. However, balancing the rotation and translation loss is not trivial as discussed in [9]

, where they attempt to directly predict rotation and translation vectors for the task of camera re-localization. PoseCNN

[26] directly outputs rotation and translation vectors for the object pose estimation by predicting them separately in a multi-stage network. Unlike PoseCNN, which predicts the rotation quaternions, SSD-6D [8] converts the pose estimation problem into a classification problem by discretizing the views instead of directly predicting the pose. The above-mentioned methods let the network predict the pose from color images directly, which can be difficult for CNNs to achieve, as the CNNs are required to learn all the geometrical knowledge from training data alone.

Keypoint based methods

Another way to formulate the output of CNNs for object pose estimation is to detect keypoints and then use the Perspective-n-Point (PnP) algorithm [10] to estimate the final pose. The works of [16, 18, 23] achieve significant improvements in pose accuracy on challenging datasets, in particular on textureless objects. A key problem in the above methods is inaccurate predicted 2D keypoints. PnP-based pose estimation techniques tend to produce highly perturbed pose estimation results even by small amounts of noise in the predicted 2D keypoints. BB8 [18] encounters this problem when predicting a single pose proposal using a CNN on cropped object segments. Due to the noisy regression outputs of CNNs, a single pose proposal often does not result in an accurate one. Also, BB8 is not able to perform the task of object pose estimation in real-time. To this end, Tekin [23] uses the YOLOv2 object detection network to predict keypoint proposals, but the method lacks an effective way to cull false positives. It uses neighborhood weighted averaging for the keypoints proposals centered around the most confident keypoint proposal. Recently proposed PV-Net [17] tries to address the problem of partial occlusion in RGB based object pose estimation by regressing for dense pixel-wise unit vectors pointing to the keypoints, which are combined using RANSAC like voting scheme.

Pose refinement methods

Recent deep learning solutions have also considered techniques for pose refinement from RGB images [11, 15] as a way to bridge the gap between RGB and RGBD pose accuracies. DeepIM [11] uses a FlowNet backbone architecture to predict a relative SE(3) transformation to match the colored rendered image of an object using the initial pose to the observed image. Manhardt [15] introduce a visual loss that optimizes the predicted refinement of translation and rotation by aligning the contours of the object in a rendered pose with an initial rotation and translation and the scene images. The problem specifically targeted in this paper is about culling false positives from several object pose proposals, and such a refinement mechanism can still be used at the end of our pipeline. To the best of our knowledge, there is no work directly addressing the problem of unreliable object pose confidences produced by CNNs.

Inaccurate object confidences also cause performance degradation in multi-object pose estimation where multiple object pose proposals are predicted for each object. Most state-of-the-art object detection methods [21, 19, 13], are dependent on non-maximum suppression (NMS) to cull overlapping, less confident, object proposals. NMS relies on the confidence measure produced by a CNN for a proposal, which is, again, noisy. Our proposed approach addresses the above-mentioned problems associated with the object confidence output of CNNs by calibrating the confidence measures using knowledge of each pose hypothesis. These calibrated confidence predictions can then be used both in single and multi-object pose estimation.

3 Approach

Figure 2:

Overview of our pose estimation pipeline. Our approach operates in two stages: a) three 3D tensors are outputted using a YOLOv3 based architecture at 3 different scales in the form

outputs along a spatial grid of each tensor. b) using sets of 2D keypoint proposals, pose proposals are estimated using the E-PnP algorithm, then the original image and the pose rendered mask are cropped tightly fitting the rendered mask. Cropped RGB patches concatenated with the corresponding pose rendered masks are passed to CullNet to output calibrated object confidences. Calibrated confidences are finally used to pick the most confident pose estimate.

In the discussion above, we identify the outstanding issue of overconfident false positives (or inaccurate pose proposals) in current state-of-the-art object pose estimation methods. We address these issues with our proposed object pose estimation pipeline illustrated in Figure 2. Our keypoint proposal method is inspired by YOLOv3 [20] which can produce many sets of object keypoint proposals in real-time. Our proposed network, CullNet, produces calibrated object confidences using knowledge of the proposed pose in relation to the observation in the original image. These calibrated confidences can then reliably be used to select a final estimate of the object pose from several object pose proposals.

Using a feature pyramid network [12], the backbone architecture outputs several pose proposals in the form of 2D keypoints. The network is based on the Darknet-53 architecture [20]. One of the crucial advantages of the YOLO network architecture is the gain in speed for object pose prediction, as it is one of the state-of-the-art real-time object detection approaches. The network takes an input image of resolution and produces outputs at three scales in the form of 3D tensors with spatial sizes , and cell grids where each grid point corresponds to a dimensional vector which includes coordinates of 2D keypoints, proposal confidence and class scores. In the case of YOLO object detection, the confidence loss is learned based on ground truth IoU overlap between the prediction boxes and the ground truth boxes. Such a formulation of IoU is not easily established in the case of 2D projections of correspondences. We use a confidence function proposed in [23]

to assign probabilities to distances between each 2D keypoint in the pose proposals and ground truth 2D keypoints based on some threshold. It is defined as follows:


The distance is the Euclidean distance between the predicted 2D keypoints represented by and respective ground truth 2D keypoint. The confidence function is set to for predictions with a distance value greater than or equal to the threshold . The sharpness of the exponential function is defined by the parameter . In place of the IoU based confidence measure of YOLOv3, the final confidence for each proposal is thus calculated as the mean value of over 2D keypoint predictions.

The backbone network described above predicts object pose proposals in terms of coordinates of 2D projected correspondences of object keypoints. In the case of single or multiple instances of an object in the scene, choosing one or many of them is not trivial. In object pose prediction, a culling process with inaccurate object confidence scores often results in culling a better candidate for pose prediction because of being predicted with lower confidence. We thus propose a new confidence calibrating network called CullNet, to predict better confidence measures based on the pose information of each pose proposal. In between the backbone network and CullNet, there is an intermediate processing step to associate pose information with each keypoint proposal as explained below.

First, we take the top- most confident 2D keypoint proposals output by the backbone network and estimate the pose for these proposals using the PnP algorithm. For each of the pose proposals, we render binary object segmentation masks. We want to emphasize here that this mask rendering does not require any extensive computation as it does not involve a colored mask. These segmentation masks can simply be calculated by finding the 2D projections of all the mesh vertices of an object. Each rendered mask of proposals is cropped out, tightly around the segmentation boundaries. With the same cropping coordinates, the corresponding RGB patch is formed after cropping the input image. Then, the cropped segmentation mask is concatenated as the fourth channel along with corresponding RGB patch. For each top- most confident proposals, our proposed CullNet takes concatenated RGB patch and mask () to predict how accurately each pose proposal aligns with the cropped RGB patch. We formulate the ground truth confidence measure for our final output from the proposed CullNet using Eq. 1. The Euclidean distance mentioned in Eq. 1 is


Here denotes the 3D vertex from the object’s mesh model, is the ground truth pose, is the intrinsic camera parameters and is the predicted pose amongst the top- most confident pose proposals from the keypoint proposal network. It is important to note here that the final ground truth value for the calibrated confidence i.e. , is the mean over the 2D projections of all mesh vertices of an object as:


CullNet is based on the Resnet50 architecture with the group norm [25] replacing the batch norm. It takes a 4 channel input of masked out RGB patches. Group Normalization helps in faster convergence of the network with larger batch sizes including patches from the same images having a distribution, that degrades batch norm’s statistic estimation [6].

3.1 Training

Our complete approach is trained in two stages. First we train the backbone network and then train CullNet using the proposals generated by the backbone network.

Keypoint proposal network

In the first stage, the backbone network needs to learn prediction of 2D keypoints, confidence scores and class probabilities. The predictions for 2D keypoints are done in the down-scaled size of image coordinates to , and respectively. The 2D keypoint predictions are expressed as an offset from the top-left corner of the grid cells. The ground truth confidence scores for the set of 2D keypoints based pose proposal corresponding to each grid cell are calculated using Eq. 1

where the mean confidence of each set of proposals is calculated as the average over each keypoint confidence. We use a sigmoid function to restrict the predicted confidence score to the range

. We minimize the following loss function to train our backbone network.


Here, the terms , and denote the keypoint, confidence and the classification loss, respectively. We use mean-squared error for the coordinate and confidence losses, and cross entropy for the classification loss. The respective loss functions are formulated as follows for each of the three 3D tensor outputs of the keypoint proposal network:


where denotes if the object’s centroid keypoint appears in cell , where it is else it is and the normalizing constants and . and represent predicted and ground truth confidence scores of the keypoint proposal network. , and , denote coordinates for predicted and ground truth keypoints for each set of proposals amongst keypoint proposals, where varies from , and in three different scales. Here, and represent predicted and ground truth class probability vectors.

Culling Mechanism

In the final stage of training, CullNet needs to learn a prediction of a calibrated pose-aware confidence measure. We use the sigmoid function to predict outputs of CullNet in the range . The ground truth calibrated confidences at this stage are calculated based on Eq. 1, as an average of the confidence of all 2D projections of mesh vertices at each predicted pose proposal respectively, using Eq. 3. For each image, the backbone network passes the top- most confident object keypoint proposals to the CullNet. Then, pose hypotheses are estimated for each keypoint proposal using the E-PnP algorithm [10]. CullNet then uses concatenated cropped RGB image patches and mask renderings as an input (rescaled) for each proposal to produce a confidence measure on how accurate the proposed pose is. We use mean-squared error for the calibrated confidence loss.

3.2 Inference

For inference, we first output the top- most confident keypoints proposals of each object. Then, for each keypoint proposal, the object pose is estimated using the E-PnP algorithm. Based on the predicted pose of the top- most confident keypoint proposals, tightly cropped object regions in a pose rendered mask and corresponding patches in concatenation are input to CullNet to predict calibrated confidences. Finally, using arg-max on the calibrated confidences outputted by CullNet, we find the estimated pose for the object.

4 Experiments

Ape Bvise Cam Can Cat Driller Duck Box Glue Holep Iron Lamp Phone Avg.


2D Reprojection-5px
Ours w/ BC 97.7 99.0 97.9 98.9 98.7 96.4 97.0 98.7 98.2 99.0 97.2 95.4 95.6 97.7
Ours w/o BC 97.6 99.0 98.6 98.9 98.6 96.5 96.8 98.7 98.3 99.0 96.0 94.7 95.1 97.5
Tekin [23] 92.1 95.1 93.2 97.4 97.4 79.4 94.7 90.3 96.5 92.9 82.9 76.9 86.1 90.4
BB8[18] 95.3 80.0 80.9 84.1 97.0 74.1 81.2 87.9 89.0 90.5 78.9 74.4 77.6 83.9
DeepIM (*) [11] 98.4 97.0 98.9 99.7 98.7 96.1 98.5 96.2 98.9 96.3 97.2 94.2 97.7 97.5
BB8[18] (*) 96.6 90.1 86.0 91.2 98.8 80.9 92.2 91.0 92.3 95.3 84.8 75.8 85.3 89.3
Brachmann[2] (*) 85.2 67.9 58.7 70.8 84.2 73.9 73.1 83.1 74.2 78.9 83.6 64.0 60.6 73.7


Ours w/ BC 55.1 89.0 66.2 89.2 75.3 88.6 41.8 97.1 94.6 68.9 90.9 94.2 67.6 78.3
Ours w/o BC 34.5 79.2 71.5 85.8 71.1 89.3 39.3 86.1 87.6 70.4 85.8 73.9 63.8 72.2
Do [3] 222For this method, results on the 2D Reprojection metric are not available. 38.8 71.2 52.5 86.1 66.2 82.3 32.5 79.4 63.7 56.4 65.1 89.4 65.0 65.2
Tekin [23] 21.6 81.8 36.6 68.8 41.8 63.5 27.2 69.6 80.0 42.6 74.9 71.1 47.7 55.9
BB8[18] 27.9 62.0 40.1 48.1 45.2 58.6 32.8 40.0 27.0 42.4 67.0 39.9 35.2 43.6
SSD-6D[8] 0 0.2 0.4 1.4 0.5 2.6 0 8.9 0 0.3 8.9 8.2 0.2 2.42
DeepIM (*) [11] 77.0 97.5 93.5 96.5 82.1 95.0 77.7 97.1 99.4 52.8 98.3 97.5 87.7.0 88.6
Manhardt [15] (*) - - - - - - - - - - - - - 34.1
BB8[18] (*) 40.4 91.8 55.7 64.1 62.6 74.4 44.3 57.8 41.2 67.2 84.7 76.5 54.0 62.7
SSD-6D[8] (*) - - - - - - - - - - - - - 76.3
Brachmann[2] (*) 33.2 64.8 38.4 62.9 42.7 61.9 30.2 49.9 31.2 52.8 80.0 67.0 38.1 50.2
Table 1: The comparison of accuracies of our method and the baseline methods on the LINEMOD

dataset using standard pose evaluation metrics. (*) denotes pose refinement methods. BC refers to bias correction using error modes from train data.

We evaluate our approach on the task of single object pose estimation and show comparisons with the state-of-the-art RGB based object pose estimation approaches.

4.1 Implementation Details

We use Darknet-

pre-trained on the ImageNet classification task as our backbone network. In the Keypoint proposal training, we train only for classification and regression loss for the first

epochs and all losses for the next epochs. CullNet is trained for epochs. The sharpness of the confidence function is set to and the distance threshold to pixels. We found to be best at keeping the speed-accuracy trade-off in mind. The backbone network has been trained with a batch size of 16 and CullNet with a batch size of 128. We start with a learning rate of for the backbone network using the SGD optimizer and divide the learning rate by a factor of 10 after 50 and 75 epochs respectively. We use a learning rate of for the culling network using the SGD optimizer and divide the learning rate by a factor of 10 after 10 epochs. The number of group norm channels in CullNet are found to be best at . To avoid overfitting, we use extensive data augmentation for training CullNet by randomly changing the hue, saturation, and exposure of the image by up to a factor of . We also randomly scale and translate the image by up to a factor of of the image size. During the training of CullNet, we double the number of pose proposals for each image by randomly perturbing the estimated pose from the keypoint proposal network to avoid overfitting. We choose corners and the centroid of the cuboid bounding the object as the 9 keypoints in our experiments (similar to Tekin [23]).

4.2 Evaluation Metrics

We use two standard metrics to evaluate the 6D pose accuracy, namely 2D reprojection error, and the AD{DI} metric as used in [2, 8, 18].

2D Reprojection measures the mean distance between the 2D projections of the object’s mesh vertices using the ground truth pose and the estimated pose, for each object pose instance. A pose instance is considered correct if the mean distance is less than 5 pixels.

In contrast, the AD{DI} metric measures the mean distance between the transformed coordinates of mesh vertices using the ground truth pose and the estimated pose for each object pose instance. A pose instance is considered correct if the mean distance is less than of the object mesh model’s diameter. To handle rotationally symmetric objects, the mean distance is calculated based on the closest point distance as done in [18].

Figure 3: Percentage of correctly estimated poses at different thresholds of reprojection error (in pixels) for different objects of the LINEMOD dataset [5].

4.3 LINEMOD Dataset

The LINEMOD dataset [5] is a standard benchmark dataset for 6D pose estimation. This dataset is comprised of 13 object classes involving many challenges such as background clutter and textureless objects. Each RGB image has been annotated with only the central object in the scene. We use the same data split for each class as Brachmann [2] used, with around 200 images for each object in the training set and 1,000 images in the test set. To prevent overfitting, for training we generated synthetic images by rendering objects with uniformly sampled viewpoints with backgrounds randomly selected from the SUN397 dataset [27]. To keep the distributions of real and synthetic images the same and also to avoid learning any information from the checkerboard background, we augment the real training images by using the segmentation mask from real images and changing the background from images randomly sampled from the PASCAL-VOC dataset [4].

We show comparisons with competing RGB based object pose estimation methods in Table 1. Our approach outperforms all existing methods comfortably on the 2D-Reprojection metric. It also performs slightly better than the state-of-the-art pose refinement methods on this metric. We want to emphasize the fact that our method, which works using a two-stage pipeline does not use any pose refinement method. Pose refinement methods most often require multiple iterations of refinement along with complete colored renderings of mesh models. Our approach requires only a segmentation mask rendering from the top- confident pose estimates to calibrate the confidence scores within a single pass through CullNet.

Our proposed approach also performs better than all existing comparable methods when evaluated on the AD{DI} metric. However, the DeepIM [11] pose refinement method outperforms our approach on this metric whereas ours perform better on the 2D-Reprojection metric. We investigated this issue which led to the findings that the LINEMOD dataset has many instances of noisy pose annotations due to registration errors between the RGB and the depth image because the pose annotation process was done using ICP on the depth images. A similar observation was also made by Manhardt [15] evaluating their deep pose refinement method. To partially address this issue, we calculate the error statistics on the LINEMOD training data using the ADD metric from the pose estimated by our final trained network pipeline. We make the histogram plots (using 400 bins) for the ADD error in z-axis after transforming coordinates of mesh vertices using the estimated pose for each object pose instance. Then, we use the modes of training errors along the z-axis for each class as an offset to correct the bias. The offset is added to the translation in the z-axis of all the predicted pose instances by our method, to partially solve the bias problem arising due to noisy annotations.

Culling Methods 2D-Reprojection


YOLOv2 + argmax 91.6
YOLOv3 + argmax 95.7
YOLOv2 + Top-6 CullNet 93.4
YOLOv3 + Top-6 CullNet 97.7


(a) Accuracy comparisons on LINEMOD dataset using different culling methods on multiple keypoint proposal networks. (b) Robustness of RANSAC vs. CullNet with varying number of top- most confident pose proposals using YOLOv3 as keypoint proposal network.

Table 2: Ablation studies to show the effectiveness of CullNet on LINEMOD dataset.
2D Reprojection-5px AD{DI}-10%
Methods BB8 Tekin PoseCNN Jafari OURS Tekin PoseCNN OURS
[18] [23] [26] [7] (with BC) [23] [26] (with BC)


ape 28.5 7.01 34.6 24.2 55.98 2.48 9.6 21.97
can 1.20 11.20 15.1 30.2 39.11 17.48 45.2 24.52
cat 9.60 3.62 10.4 12.3 34.2 0.67 0.93 9.77
driller 0.0 1.40 7.4 - 29.32 7.66 41.4 26.11
duck 6.80 5.07 31.8 12.1 53.46 1.14 19.6 23.62
eggbox - - 1.9 - 0.17 - 22 20.43
glue 4.70 6.53 13.8 25.9 23.48 10.08 38.5 28.02
holepuncher 2.40 8.26 23.1 20.6 72.98 5.45 22.1 41.4
average 7.60 6.16 17.2 20.8 38.59 6.42 24.9 24.48


Table 3: The comparison of accuracies of our method and the baseline methods on the Occlusion LINEMOD dataset. BC refers to bias correction using error modes from train data.

4.4 Ablation Studies

We conduct ablation studies to evaluate the effectiveness of CullNet in comparison to other potential methods for the culling process on the LINEMOD dataset in Table 2 (a) and Figure 2 (b). Two such candidate methods are the arg-max selection of the most confident pose proposal and using RANSAC on the top- most confident pose proposals.

We evaluate CullNet on top of multiple keypoint proposal networks, namely YOLOv2 and YOLOv3. Our method comfortably outperforms argmax based selection of the most confident pose proposal for both keypoint proposal networks as shown in Table 2 (a). This clearly reflects the problem of un-calibrated confidence scores in case of argmax based selection in both YOLOv2 and YOLOv3. We also show pose accuracies for all classes of the LINEMOD dataset at varying reprojection error thresholds in the 2D-Reprojection metric in Figure 3. These results resonate the effectiveness of CullNet in improving the final pose estimates over a varying range of reprojection error thresholds for the 2D Reprojection metric.

We also show how robust our method is to variations in the number of most confident pose proposals chosen for the culling process in Table 2 (b). CullNet is shown to be extremely stable to a large number of pose proposals whereas the accuracy starts degrading as grows in the case of RANSAC. This is related to the fact that our method can differentiate between falsely detected object regions and correct object regions. This property specifically helps in cases where after increasing , we introduce false object proposals such as yellow cup instead of yellow duck.

4.5 Occlusion LINEMOD Dataset

Though this work does not attempt to address the problem of partial occlusions in RGB based object pose estimation, it is interesting to see how our approach behaves on such hard examples after training only on the completely un-occluded pose instances. For this, we evaluated our approach on the Occlusion LINEMOD dataset [1]. This dataset was created by annotating 8 objects in a sequence of 1215 frames from the LINEMOD dataset. This dataset contains challenging cases of severe partial occlusions. We use the same trained models for evaluation on the Occlusion LINEMOD dataset as we use for the LINEMOD dataset.

We show comparisons with state-of-the-art RGB based pose estimation methods on the Occlusion LINEMOD dataset in Table 3. Our approach outperforms most of the state-of-the-art methods with a huge margin on the 2D-Reprojection metric. It also performs comparably against state-of-the-art on the AD{DI} metric. This is an interesting result considering that we do not use any occluded examples during our training process.

5 Conclusion

We have introduced a new object pose estimation pipeline based on RGB images only. Our pose estimation pipeline consists of a keypoint proposal network producing several object pose proposals and a new culling mechanism to select the best final pose estimate. We show detailed experimentation on two challenging benchmark datasets where it outperforms state-of-the-art methods. We also show the superiority of our approach to RANSAC and other culling strategies in terms of pose accuracies and robustness against variations in the number of pose proposals.

Supplementary Material

CullNet: Calibrated and Pose Aware Confidence Scores for Object Pose Estimation

In the supplementary material, we show confidence plots to see calibration effect of CullNet on all classes of LINEMOD [5] dataset. We also show the network architecture of our keypoint proposal network i.e. YOLOv3 [20] in Figure 4.

Figure 4: Network architecture of Keypoint proposal network (YOLOv3 [20]). denotes the number of outputs from each grid cell.
Figure 5: Comparisons of pose proposal confidence output of Keypoint proposal network and CullNet for different classes of LINEMOD dataset.

Confidence Plots

We show comparisons of pose proposal confidence output of Keypoint proposal network and CullNet for all classes333For ’Duck’ class, we show these plots in the main paper. of LINEMOD dataset in Figure 5.


  • [1] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother (2014) Learning 6d object pose estimation using 3d object coordinates. In

    European conference on computer vision

    pp. 536–551. Cited by: §4.5.
  • [2] E. Brachmann, F. Michel, A. Krull, M. Ying Yang, S. Gumhold, et al. (2016) Uncertainty-driven 6d pose estimation of objects and scenes from a single rgb image. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 3364–3372. Cited by: §4.2, §4.3, Table 1.
  • [3] T. Do, M. Cai, T. Pham, and I. Reid (2018) Real-time monocular object instance 6d pose estimation. In BMVC 2018, Cited by: Table 1.
  • [4] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §4.3.
  • [5] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab (2012) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian conference on computer vision, pp. 548–562. Cited by: Figure 3, §4.3, §5.
  • [6] S. Ioffe (2017)

    Batch renormalization: towards reducing minibatch dependence in batch-normalized models

    In Advances in Neural Information Processing Systems, pp. 1945–1953. Cited by: §3.
  • [7] O. H. Jafari, S. K. Mustikovela, K. Pertsch, E. Brachmann, and C. Rother (2017) IPose: instance-aware 6d pose estimation of partly occluded objects. arXiv preprint arXiv:1712.01924. Cited by: Table 3.
  • [8] W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab (2017) SSD-6d: making rgb-based 3d detection and 6d pose estimation great again. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 1530–1538. Cited by: §1, §2, §4.2, Table 1.
  • [9] A. Kendall, M. Grimes, and R. Cipolla (2015) Posenet: a convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pp. 2938–2946. Cited by: §2.
  • [10] V. Lepetit, F. Moreno-Noguer, and P. Fua (2009) Epnp: an accurate o (n) solution to the pnp problem. International journal of computer vision 81 (2), pp. 155. Cited by: §1, §2, §3.1.
  • [11] Y. Li, G. Wang, X. Ji, Y. Xiang, and D. Fox (2018) Deepim: deep iterative matching for 6d pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 683–698. Cited by: §2, §4.3, Table 1.
  • [12] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125. Cited by: §3.
  • [13] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §2.
  • [14] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §2.
  • [15] F. Manhardt, W. Kehl, N. Navab, and F. Tombari (2018) Deep model-based 6d pose refinement in rgb. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 800–815. Cited by: §2, §4.3, Table 1.
  • [16] M. Oberweger, M. Rad, and V. Lepetit (2018) Making deep heatmaps robust to partial occlusions for 3d object pose estimation. arXiv preprint arXiv:1804.03959. Cited by: §1, §1, §2.
  • [17] S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao (2019) PVNet: pixel-wise voting network for 6dof pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4561–4570. Cited by: §2.
  • [18] M. Rad and V. Lepetit (2017) BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In International Conference on Computer Vision, Vol. 1, pp. 5. Cited by: §1, §2, §4.2, §4.2, Table 1, Table 3.
  • [19] J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. arXiv preprint. Cited by: §2.
  • [20] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: Figure 4, §1, §1, §1, §3, §3, §5.
  • [21] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §2.
  • [22] F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce (2006) 3d object modeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints. International Journal of Computer Vision 66 (3), pp. 231–259. Cited by: §2.
  • [23] B. Tekin, S. N. Sinha, and P. Fua (2017) Real-time seamless single shot 6d object pose prediction. arXiv preprint arXiv:1711.08848. Cited by: §1, §1, §2, §3, §4.1, Table 1, Table 3.
  • [24] D. Wagner, G. Reitmayr, A. Mulloni, T. Drummond, and D. Schmalstieg (2008) Pose tracking from natural features on mobile phones. In Proceedings of the 7th IEEE/ACM international symposium on mixed and augmented reality, pp. 125–134. Cited by: §2.
  • [25] Y. Wu and K. He (2018) Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §3.
  • [26] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox (2017)

    Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes

    arXiv preprint arXiv:1711.00199. Cited by: §2, Table 3.
  • [27] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba (2010)

    Sun database: large-scale scene recognition from abbey to zoo

    In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. Cited by: §4.3.