Explaining the Ambiguity of Object Detection and 6D Pose from Visual Data

12/01/2018 ∙ by Fabian Manhardt, et al. ∙ 1

3D object detection and pose estimation from a single image are two inherently ambiguous problems. Oftentimes, objects appear similar from different viewpoints due to shape symmetries, occlusion and repetitive textures. This ambiguity in both detection and pose estimation means that an object instance can be perfectly described by several different poses and even classes. In this work we propose to explicitly deal with this uncertainty. For each object instance we predict multiple pose and class outcomes to estimate the specific pose distribution generated by symmetries and repetitive textures. The distribution collapses to a single outcome when the visual appearance uniquely identifies just one valid pose. We show the benefits of our approach which provides not only a better explanation for pose ambiguity, but also a higher accuracy in terms of pose estimation.



There are no comments yet.


page 6

page 10

page 11

page 12

page 13

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Driven by deep learning, image-based object detection has recently made a tremendous leap forward in both accuracy as well as efficiency 

[38, 13, 30, 37]

. An emerging research direction in this field is the estimation of the object’s pose in 3D space over the existing 6-Degrees-of-Freedom (DoF) rather than on the 2D image plane 

[23, 46, 8]

. This is motivated by a strong interest in achieving robust and accurate monocular 6D pose estimation for applications in the field of robotic grasping, scene understanding and augmented/mixed reality, where the use of a 3D sensor is not feasible 

[33, 27, 49, 44].

Nevertheless, 6D pose estimation from RGB is a challenging problem due to the intrinsic ambiguity caused by visual appearance of objects under different viewpoints and occlusion. Indeed, most common objects exhibit shape ambiguities and repetitive patterns that cause their appearance to be very similar under different viewpoints, thus rendering pose estimation a problem with multiple correct solutions. Furthermore, also occlusion (from the same object or from others) can cause pose ambiguity.

Figure 1: Ambiguity along an arc. The left and mid viewpoints yield the same object visual appearance, as well as all views among them (continuous arc). Only when the handle is visible (right frame, dotted arc), the pose becomes unambiguous.

For example, as illustrated in Figure 1, the cup is identical from every viewpoint along the continuous grey arc. Thus, from a single image, it is impossible to univocally estimate the current object pose. Moreover, object symmetry can also induce visual ambiguities leading to multiple poses with the same visual appearance. However, most datasets do not reflect this ambiguity, since ground truth pose annotations are mostly uniquely defined at each frame. This is problematic for a proper optimization of the rotation, since a visually correct pose still results in a high loss. Therefore, many recent 3D detectors avoid to directly regress the rotation and, instead, explicitly model the solution space in an unambiguous fashion [35, 23].

Figure 2: Our method provides multiple hypotheses for the pose, in order to approximate the distribution in the solution space. We predict pose hypotheses, each visually identical, however, differently rotated around the object’s symmetry axis.

Essentially, in  [23]

, the authors train their convolutional neural network (CNN) by mapping all possible pose solutions for a certain viewpoint onto an unambiguous arc on the view sphere. Rad

et al[35]

employ a separate CNN solely trained to classify the symmetry in order to resolve these ambiguities. However, this simplification exhibits several downsides, such as the explicit inclusion of information about certain symmetries in each trained object. Moreover, this is not always easy to model, as

e.g. in the case of partial view ambiguity. Further, all these approaches rely on prior knowledge and annotation of the object symmetries and aim to solve the ambiguity by providing a single outcome in terms of estimated pose and object. Added to this, these methods are also unable to deal with ambiguities generated by other common factors such as occlusion.

On the contrary, Sundermeyer et al[41] recently proposed a novel method to conduct pose estimation in an ambiguity-free manner. To this end, they first utilize SSD for localization. Subsequently, they employ an augmented auto-encoder on each 2D detection and conduct a NN lookup in the feature space to retrieve the detection’s rotation. Hence, they can implicitly deal with ambiguities since views with identical visual appearance are mapped to the same location in the latent space. Although the method is able to deal with ambiguities implicitly, it does not model their detection and description explicitly.

In this paper we propose to model the ambiguity of the object detection and pose estimation tasks directly by allowing our learned model to predict multiple solutions, or hypotheses, for a given object’s visual appearance. Inspired by Rupprecht et al[39]

we propose a novel architecture and loss function for monocular 6D pose estimation by means of multiple predictions. Essentially, each predicted hypothesis is a 6D pose and object class. When the visual appearance is ambiguous, the model predicts a point estimate of the distribution in 3D pose space. Conversely, when the object’s appearance is unique, the hypotheses will collapse into the same solution. Importantly, our model is capable of learning the distribution of these 6D hypotheses from one single ground truth pose per sample, without further supervision.

Besides providing more insight and a better explanation for the current task, the additionally gained knowledge can be exploited to improve the accuracy of the pose estimation. In essence, analyzing the distribution of the hypotheses enables us to classify if the current perceived viewpoint is ambiguous and to compute the axis of ambiguity for the specific object and viewpoint. Subsequently, when ambiguity is detected, we can employ mean shift [7] clustering over the hypotheses in quaternion space to find the main modes for the current pose. A robust averaging in 3D rotation space for each mode then yields a highly accurate pose estimation. When the view is ambiguity-free, we can improve our pose estimates by robustly averaging over all 6D hypotheses, and by taking advantage of the predicted pose distribution as a confidence measure.

2 Related Work

We first review recent work in object detection and pose estimation for 2D and 3D data. Afterwards, we discuss common grounds and main differences with approaches aimed at symmetry detection for 3D shapes.

Object Detection and Pose Estimation.

Methods for object detection and pose estimation can be subdivided into two categories, i.e. RGB-D and RGB based methods. While the former is dominating this sector, the latter is currently gaining interest.

Hinterstoisser et al[16] proposed a novel image representation for template matching using image gradient orientations to robustly differentiate between different viewpoints. By using hashing to retrieve candidate templates, [5, 25, 20] achieve a sub-linear matching complexity with respect to the number of objects.

Further recent research focuses on learning-based methods. For instance, [3, 45]

employ random forests to detect objects in the scene and estimate the associated 6D pose. Certainly, deep learning methods have been proposed recently. For example,

[48, 24] employ CNNs to learn an embedding space for the pose and class from RGB-D data, which can subsequently be utilized for retrieval. Notably, the majority of most recent deep learning based methods focus on RGB as input [23, 35, 8, 46, 50, 41]. Since utilizing pre-trained networks often accelerates convergence and leads to better local minima, these methods are usually grounded on state-of-the-art backbones for 2D object detection, such as Inception [43] or ResNet [13]. In particular, Kehl et al[23] employ SSD [30] with an InceptionV4 [42] backbone and extend it to also classify view point and in-plane rotation. Similarly, Sundermeyer et al[41] also use SSD for localization, but employ an augmented auto-encoder for the unambiguous retrieval of the associated 6D pose. Rad et al[35] utilize VGG [40] and augment it to provide the 2D projections of the 3D bounding box corners. A similar approach is chosen by [46], based on YOLO [37]. Afterwards, both apply PP to fit the associated 3D bounding box into the regressed 2D projections, in order to estimate the 3D pose of the detection. In [50], Xiang et al. compute a shared feature embedding for subsequent object instance segmentation paired with pose estimation. Finally, Do et al[8] extend Mask-RCNN [12] by a third pose branch, which additionally provides the 3D rotation and the distance to the camera for each prediction. Afterwards, they compute the center of the 2D bounding box and back-project it to 3D, using the regressed depth value.

Object Symmetry Detection

Oftentimes, object pose ambiguity arises from symmetric shapes. We review relevant methods that aim at extracting symmetry from 3D models to outline commonalities and differences with our approach.

Most methods for symmetry detection are found in the shape analysis community. Among the different kinds of symmetries, axial symmetries are of particular interest, and multiple approaches have been proposed. Most approaches rely on feature matching or spectral analysis: [9]

treat the problem as a correspondence matching task between a series of keypoints on an object, determining the reflection symmetry hyperplane as an optimization problem. Elawady

et al[10]

rely on edge features extracted using a Log-Gabor filter in different scales and orientations coupled with a voting procedure on the computed histogram of local texture and color information. In addition,

[6] and [32] are also grounded on wavelet-based approaches. Recently, Neural network approaches have also been proposed. Ke et al[22] adapt an edge-detection architecture with multiple residual units and successfully apply it to symmetry detection using real-world images.

Notably, all these approaches aim at detecting symmetries of 3D shapes alone, while our focus is to model the ambiguity arising from objects under specific viewpoints with the goal of improving and explaining pose estimation.

3 Methodology

In this section, we describe our method of handling symmetries and other ambiguities for object detection and pose estimation in detail. We will first define what we understand as an ambiguity.

3.1 Ambiguity in Object Detection and Pose Estimation

We describe the rigid body transformations via the semi-direct product of and

. While for the latter, we use Euclidean 3-vectors, the algebra

of unit quaternions is used to model the spatial rotations in . A quaternion is given by


with and . To this end, we regress quaternions above the hyperplane and thus omit the southern hemisphere, such that any possible 3D rotation can be expressed by precisely one single quaternion.

In essence, a direct naive regression of the rotation as a quaternion will lead to poor results, as the network will learn to predict a rotation that is closest to all results under symmetries. This prediction can be seen as the (conditional) mean rotation. More formally, in a typical supervised setting we associate images with poses in a dataset where . To describe symmetries, we define for a given image , the set of poses that all have an identical image


Note that in the case of non-discrete symmetries the set will contain infinitely many poses, which in turn transforms the sums of in the following to integrals. For the sake of a simpler notation and a finite training set in practice, we chose to continue with a notion of a finite . The naive model , that directly regresses a pose from optimizes a loss by minimizing


over the training set. However, due to the symmetries, the mapping from to is not well defined and cannot be modeled as a function. By minimizing Equation 3, is learned to predict a pose approximating all possible poses for this image equally well.


This is an unfavorable result since is chosen to minimize the sum of all losses towards the different symmetries. In the following section, we will describe how we model these ambiguities inside our method.

3.1.1 Multiple Pose Hypotheses

The key idea behind the proposed method is to model the ambiguity by allowing multiple pose predictions from the network. In order to predict pose hypotheses from , we extend the notation to where now returns pose hypotheses for each image .

For training, the idea is not to punish all hypotheses given the current pose annotation, since they might be correct under ambiguities. Thus, we use a loss that optimizes only one of the hypotheses for each annotation. The most intuitive choice is to pick the closest one. We adapt the meta loss from [39] that operates on ,


while we use the original pose loss for each


However, the hard selection of the minimum in equation 6 does not work in practice as some of the hypothesis functions might never be updated if they are initialized far from the target values. Thus, we relax to by adding the average error for all hypotheses with an epsilon weight


The normalization constants before the two components are designed to give a weight of to the original formulation and to the gradient distributed over all other hypotheses. When , then . This is necessary since the average in the second term already contains the minimum from the first term.

3.2 Architecture

We employ SSD-300 with an extended InceptionV4 [42] backbone and adjust it to also provide the 6D pose along with each detection. In particular, we append two more ’Reduction-B’ blocks to the backbone. Essentially, we branch off after each dimensionality reduction block and place in total anchor boxes to cover objects at different scales. Moreover, to include the unambiguous regression of the 6D pose, we modify the prediction kernel such that it provides outputs for each anchor box. Thereby, denotes the number of classes, denotes the number of hypotheses, and denotes the number of parameters to describe the 6D pose. In our case, for each of the predicted hypotheses, we regress values to characterize the 6D pose, composed of a 4D quaternion for the 3D rotation and the object’s distance towards the camera. We can estimate the remaining two degrees-of-freedom by back-projecting the center of the 2D bounding box using the inferred depth.

Finally, given a set of positive boxes Pos and hard-mined negative boxes Neg for a training image, we minimize the following energy function:


For the class and the refinement of the anchor boxes, we employ the cross-entropy loss and the smooth L1-norm , respectively. In order to compare the similarity of two quaternions, we compute the angle between the estimated rotation and the ground truth rotation according to


Additionally, we employ the smooth L1-norm as loss for the depth component


Altogether, we define the final loss for each hypothesis and input image as follows


3.3 Processing Multiple Hypotheses

During inference we further analyze the predicted multiple hypotheses in order to determine wether pose of the object is ambiguous. Additionally, in case we detect an ambiguity, we can also exploit the multiple hypotheses to estimate the view-dependent axes of ambiguity.

Detection of Visual Ambiguities in Scenes.

We analyze the distribution of the predicted hypotheses in quaternion space to determine whether the pose exhibits an ambiguity. To this end, Principal Component Analysis (PCA) is performed on the quaternion hypotheses

. The singular-value decomposition of the data matrix indicates the ambiguity. In particular, if the dominant singular value

(), an ambiguity in the pose prediction is likely, while small singular values imply a collapse to one single unambiguous solution.

We determine the existence and type of ambiguity by thresholding the value of . Empirically, we find the criteria to offer good estimations for ambiguity, and the ratio to reveal the ambiguity type in a robust way. When , the ambiguity covers the entire axis, whereas being constrained to an arc otherwise. It is noteworthy that we can learn to detect ambiguities without further supervision, directly from standard datasets.

Estimation of the Axis of Ambiguity.

As mentioned, very prominent representatives for visual ambiguities are symmetries in the objects of interest, as illustrated in Fig. 3 (left) and (mid). Nevertheless, for other objects such as cups, also (self-) occlusion can induce ambiguities in appearance, as demonstrated in Fig. 3 (right).

Figure 3: Multiple instances of pose ambiguity. Left: Ambiguity around a rotation axis. Mid: Two different possible poses for each side. Right: Ambiguity constrained to an arc

In order to calculate a viewpoint dependant ambiguity axis, we take a closer look at the scenario. A rotation rotates the camera to around the rotation axis


All these rotation axes lie in the same plane which is perpendicular to the ambiguity axis . Thus, if we stack the rotation axes , we can formulate the overdetermined linear equation system . The ambiguity axis can be found as the solution to the optimization problem


which we solve for

to be more robust against outliers than with the L2-norm.

3.4 From Multiple Hypotheses to 6D Pose

After analyzing the distribution of the hypotheses, we can robustly compute the associated 6D pose for each case.

Unambiguous Object Pose.

In case of an unambiguous object pose, we utilize the multiple hypotheses as an input for a geometric median (geodesic -mean [11]) to robustify the overall estimation


The iterative calculation follows the Weiszfeld algorithm [47] in the tangent spaces to the quaternion hypersphere [4]. From a statistical perspective, our rotation measures are treated as inputs for an -estimator to robustly detect the geometric median where gives the geodesic distance on the quaternion hypersphere. In addition, we compute the median depth of all hypotheses. Afterwards, we utilize the center of the 2D detection and backproject it into 3D to obtain the translation and therewith the full 6D pose of the detection.

Ambiguous Object Pose.

As the number of possible 3D rotations is finite yet unknown, we employ mean shift [7] to cluster the hypotheses in quaternion space. To this end, the angular distance of the quaternion vectors measures their similarity. This yields either one cluster (if the poses are connected) or multiple (if they are unconnected) as illustrated in Fig. 3. For each cluster we compute a median rotation and the median depth to retrieve the associated 3D translation. Note that we only consider the depths of the hypotheses, which contributed to the corresponding cluster. We conduct simple contour checks in RGB in order to find the best fitting cluster from wich we extract the final 6D pose.

4 Evaluation

In this section, we first introduce our experimental setup. Afterwards, we demonstrate the robustness of distinguishing wether a view exhibits an ambiguity. Third, we report our 6D pose estimation accuracy for the unambiguous and the ambiguous case. Finally, we demonstrate how we can model uncertainty in pose estimation by analyzing the variance across hypotheses.

We build two datasets, for each case respectively. In particular, for the unambiguous dataset, we use the popular ‘Hinterstoisser’ dataset [14]. However, we moved the ‘glue’ and ‘eggbox’ object to the ambiguous dataset, since both exhibit several views (mostly from the top), which are not unique. Additionally, following [23, 35] we removed the ‘cup’ and ‘bowl’ objects, because no watertight CAD models are provided for them. We also discard the ‘lamp’ since the CAD model does not possess correct normal vectors for proper rendering. To the ambiguous dataset, besides the ‘glue’ and ‘bowl’ objects, we added several models from T-LESS [18] to cover different types of ambiguities. In essence, T-LESS mostly consists of symmetric and textureless industrial objects. For our experiments we choose a subset that covers both cases: complete rotational symmetry along an axis (object 4) and objects with more than one rotational symmetry (object 5, 9, 10).

4.1 Experimental Setup

As noted in [17], domain adaptation between synthetically generated data samples and real-world images simplifies the collection of training data, however at the cost of performance. Nevertheless, we render CAD models in random poses within a defined range of rotations, adding a series of augmentations, such as illumination changes, shadows and blur, as well as background images taken from the MS COCO dataset [28]. As already discussed, our method attempts to model the presence of rotational symmetries, thus it is not necessary to manually constrain the valid range of rotations per object to avoid ambiguities, which is a common work-around in recent pose estimation systems [23, 35].

We implemented our method in TensorFlow

[1] v1.5 and conducted all our experiments on an Intel i7-5820K@3.3GHz CPU with an NVIDIA GTX 1080 GPU. For all experiments, we train with a batch size of 10 and use ADAM [26] with a learning rate of .

Implementation Details.

There are several hyper-parameters that influence the performance of our method. In particular, the relaxation weight

influences the convergence to a reference ground truth. If the value is too small, only the hypothesis closest to the ground truth will be updated, causing poses to cluster around that particular point, whereas a large value will result in loose hypotheses predictions. To overcome this issue, at the beginning, we distribute comparably more loss such that all hypotheses are able to learn appropriate results, before we force them to either specialize or gather. To this end we interpolate

from to during training.

The choice of and in (8) is also influential. In our experiments, we find that setting works well in general. In contrast, has to be chosen more carefully. Similar to , a linear increasing value for works better in practise. Essentially, a large value of hinders the network to properly focus on learning to robustly detect the object. Nevertheless, the loss for rotation tends quickly to converge when choosing too small. Consequently, we set at the beginning and linearly increase it to during training.

Lastly, also the number of hypotheses influences the accuracy of the method. A small number of hypotheses might not be able to convey inputs of high ambiguity, and hinders clustering, whereas a large value results in redundant information. We validate all experiments with different to understand it’s influence.

Figure 4: Qualitative results employing a yellow cup. The red frustums visualize a collection of pose hypotheses. The blue frustum constitutes the median, which is employed to render the 3D bounding box. The collapse of all hypotheses towards one signalizes that the task is unambiguous (upper row). However, partial symmetries and occlusion lead to multiple possible outcomes for the 6D pose (lower row). In addition, our method is capable of determining the ambiguity of the pose by analyzing the distribution of hypotheses (red lines).
Evaluation metrics.

In order to properly assess the 6D pose performance, we distinguish again between ambiguous and non-ambiguous objects. When dealing with non-ambiguous objects, we report the absolute error for the 3D rotation in degrees and 3D translation in millimeters. We also show our accuracy using the Average Distance of Distinguishable Model Points (ADD) metric from [16], which measures if the average deviation of the transformed model points is less than 10% of the object’s diameter.

We also report our results for the Visual Surface Similarity (VSS) metric. As [23, 31], we define VSS similar to the Visual Surface Discrepancy (VSD) [19], however, set . Hence, we measure the pixel-wise overlap of the rendered ground truth pose and the rendered prediction. We decide to employ this metric, since it can be applied to both, ambiguous and ambiguity-free objects.

4.2 Shape Ambiguity Analysis

For each image of the ambiguous dataset, we manually annotated whether the current object view exhibits ambiguity based on the visible object texture and shape. This ground truth is used to quantitatively assess the capability of our method to detect pose ambiguity. Additionally, we compute the ground truth symmetry axes for each object. It is important to note that we do not conduct object symmetry detection, instead, we describe the perceived pose ambiguity in terms of a symmetry axis. These annotations are only used for evaluation and not during training.

In Fig. 5 (top) we first show how robustly we can detect ambiguities. Additionally, for each detected ambiguity, we compute the average discrepancy of the computed symmetry axis with the annotated ground truth axis in degrees. In essence, for the ambiguity-free case, we can report an accuracy of more than 90%, while for the ambiguous case we can also state a high accuracy of approximately 83% correctly classified views. Furthermore, the mean axis only deviates by 16°, which shows that our formulation is able to precisely explain the perceived ambiguity.

On the bottom, we respectively show one sample of estimated ambiguity axes from ‘Hinterstoisser’ and ‘T-LESS’. For each detection, we draw the estimated axis in red, while the green line denotes the hand-annotated groundtruth axis.

Figure 5: Ambiguity detection accuracy and mean axis deviations (top) and qualitative results for symmetry axis (green line) estimation (bottom). Notice that one screw was classified to be unambiguous (i.e. no axis), because the ambiguity could be resolved through the texture.

4.3 6D Pose Estimation and 2D Object Detection

We evaluate the accuracy of our method in case of ambiguous and unambiguous pose estimation. Furthermore, Fig 4 shows one qualitative example for each case. While all hypotheses (red frustums) collapse to one pose for the unambiguous views, for the ambiguous views they spread around the calculated axis of ambiguity (red line).

Unambiguous Pose Estimation.

In Tab 1 we report our results for the unambiguous subset. In particular, even for the single hypothesis case, our approach outperforms SSD-6D by appoximately 35% of relative error while also being more robust in terms of 2D detection. Comparing with Sundermeyer et al. [41] we can report a relative improvement of more than 50% referring to the ADD metric. In addition, our robust averaging over all hypotheses leads to another improvement of all metrics. Interestingly, for the unambiguous dataset it turns out that works best in terms of 6D pose and also 2D detection. Employing too many hypotheses can hinder convergence since some hypotheses tend to starve depending on the initialization and when too few hypotheses are used, the model cannot benefit from the averaging in 6D.

In contrary, [35] and [46] achieve a little higher ADD score. This can be mostly credited to the fact that they train on real data, in order to be able to close the domain gap between real and synthetic data. More precisely, they use a subset of the test data and crop the objects of interest. Thereafter, they place them on top of a random background image from MS COCO. In contrast to these approaches, [23], [41] and we, all train on purely synthetic data, which allows to train for any object, provided the CAD model. Generally, synthetic training is desired, since it saves time and money for annotating the data, however, commonly also leads to a decrease in performance. Therefore, a lot of research is currently focusing on further closing this domain gap [2, 17, 34, 36], which this work contributes to.

Rot. [°] Trans. [mm] VSS [%] ADD [%] F1
17.9 45.6 76.8 31.2 91.6
18.9 44.3 76.3 32.8 92.1
17.5 40.6 78.2 35.3 94.7
19.2 45.6 77.2 31.3 90.6
18.7 44.6 77.4 33.8 92.7
19.2 44.9 77.3 32.7 91.0
[41] 22.1
SSD-6D [23] 28.0 72.4 67.4 9.4 88.8
BB-8 [35] - - - 45.9
Tekin [46] - - - 47.9
Table 1: Pose Errors on the ‘Hinterstoisser’ subset [16] of unambiguous objects. Top: Evaluation of the impact of different number of hypotheses. Bottom: comparison with Sundermeyer et al[41], Kehl et al[23] (synthetic training data) and with Rad et al[35], Tekin et al[46] (real training data).
Ambiguous Pose Estimation.

Referring to Tab 1, for the ambiguous ‘Hinterstoisser’ objects, we can report a VSS score of 79% and an ADI score of 54%, which is a relative improvement of approximately 13% and 145% compared to SSD-6D. Surprisingly, the multiple hypothesis detector overall achieves similar performance as the single hypothesis predictor for the 6D pose. However, for the 2D detection case, we are able to increase the accuracy from 79% to 94%. As constituted, only a few views are not well defined for these objects. During investigating the results, we discovered that the single hypothesis predictor is not capable of understanding exactly these ambiguous views and tends to simply discard them. In contrast, the multiple hypotheses predictor is indeed able to understand these views and gives reliable pose predictions.

For the ambiguous T-LESS objects (Tab 4), our multiple hypotheses approach surpasses the single hypothesis estimator for all objects. In essence, the single hypothesis estimator trained and evaluated under the same conditions is not able to capture the ambiguities in pose and, thus, is not able to produce equally accurate results. Only when the ambiguities could be resolved, the single hypothesis predictor was able to compute precise poses. Sundermeyer et al. [41] is able to exceed our model by 4% and 3%, referring to VSS and ADI respectively. Since the colors of the CAD models differ widely from the real scenes, generalization to these objects is very difficult from RGB information only. However, Sundermeyer et al. make use of real data for training their 2D detector on T-LESS, however, this has also a direct impact on their 6D pose estimates. First, they infer depth from the aspect ratio of the 2D bounding boxes. Second, deducing the rotation in a two-stage fashion simplifies the overall problem, yet, comes with a linear cost in runtime. In conclusion, we can state overall good performance, while relying on synthetic training data only, whereas [41] require real data to enable training on T-LESS.

VSS [%] ADI [%]
Scene [41] [41]
obj_04 5, 9 72.2 68.6 78.5 17.8 14.1 15.2
obj_05 2, 3, 4 84.6 82.8 88.8 65.3 48.3 76.3
obj_09 5, 11 81.8 79.1 86.5 66.8 54.5 77.3
obj_10 5, 11 82.8 78.5 82.3 39.0 29.4 31.9
mean 80.2 77.3 84.0 47.2 36.6 50.6
VSS [%] ADI [%] F1
[23] [23] [23]
eggbox 83.5 78.5 76.3 54.1 56.0 26.3 98.0 83.0 93.7
glue 74.4 74.0 65.1 54.0 58.7 17.6 90.3 74.0 76.8
mean 78,9 76.3 70.7 54.1 57.4 22.0 94.2 78.5 85.5
Table 2: Evaluation scores on the ambiguous dataset (top: T-LESS) (bottom: ‘glue’ and ‘eggbox’ object of ‘Hinterstoisser’). Thereby, we compare our multiple hypotheses () results and the same predictor trained to output a single hypothesis () with Sundermeyer et al[41]111The authours of [41] kindly provided us with their 6D poses.and SSD-6D [23].

4.4 Employing Multiple Hypothesis as Measurement for Reliability

To the best of our knowledge, there is no prior work where both the confidence in class and detection, as well as the confidence in the continuous pose estimate is modeled. Yet, this knowledge can highly improve the overall robustness and accuracy. In our case, besides the additional information on the viewpoint ambiguity, we can utilize the different hypotheses as a measurement for the confidence in the unambiguous 6D pose. To quantify the effect of this, we report our test results on the unambiguous subset of ‘Hinterstoisser’ in Fig. 6

(top), where we compute a confidence measure via the standard deviation with respect to the Karcher mean 


Naturally, a lower standard deviation means more accurate pose estimates. For instance, by only allowing poses with a standard deviation of less than , we can strongly improve each metric, while only losing approximately of all pose estimates. Specifically, the rotational error decreases by approximately 20% (i.e. from °to °). Additionally, the translation error also decreases a little from 44.8mm to 43.0mm. Accordingly, employing an even lower threshold for the standard deviation (e.g. ) gives another significant improvement for the pose (especially in rotation), however, for the cost of rejecting more estimates. In addition, the qualitative example on the bottom also confirms these results. Essentially, while the pose of the ‘driller’ with the lowest standard deviation is very accurate, on the contrary, the pose possessing the highest is rather imprecise. We experienced the same behavior for any object of the unambiguous dataset. Summarizing, one can ascertain from Fig 6 that the standard deviation of the pose hypotheses can be utilized as a robust measurement for reliability of the pose.

STD Rot. [°] Trans. [mm] VSS [%] ADD [%] Rejects [%]
11.8 39.4 80.0 37.7 32.6
13.8 41.3 79.1 35.5 18.2
15.5 43.0 78.3 34.3 10.5
17.3 44.0 77.7 33.4 4.0
19.2 44.8 77.3 32.7 0.0
Figure 6: Top: We report the results for different bins for the standard deviation over all hypotheses for the poses. Bottom: One qualitative example referring to the unambiguous ‘driller’ object. We show the pose with the lowest (left) and the highest (right) standard deviation in the hypotheses. Thereby, the blue bounding box depicts the ground truth pose, the red bounding box the predicted pose and the red frustums illustrate the hypotheses.

5 Conclusion

We propose a new approach for pose estimation that implicitly models ambiguities. Our method performs pose regression in quaternion space formulating a quaternion-specific distance measure as loss function. We show that our approach models rotational ambiguity without requiring any input pre-processing as well as the feasibility of domain adaptation between synthetic and real data. We can estimate the axis of rotational ambiguity and perform pose refinement based on clustering without requiring to know the number of clusters in advance. Lastly, we argue that our method constitutes a metric of uncertainty for the 6D pose.

Our experiments show that our method is suitable for detecting challenging objects that exhibit multiple rotational symmetries as well as datasets with little ambiguity.

In conclusion, we believe that the new formulation of the pose detection problem from images as an ambiguous task paves the way towards interesting applications in the domain of robotic interactions and automation. Moreover, it opens the space for discussions about proper evaluation metrics that reflect the ambiguity in this domain.

Acknowledgments We would like to thank Toyota Motor Corporation for funding and supporting this work.

Explaining the Ambiguity of Object Detection and 6D Pose from Visual Data

Supplementary Material

(a) ’ape’
(b) ’bvise’
(c) ’cam’
(d) ’can’
(e) ’driller’
(f) ’duck’
(g) ’holep’
(h) ’iron’
(i) ’phone’
(a) ’eggb’
(b) ’glue’
(c) ’obj_04’
(d) ’obj_05’
(e) ’obj_09’
(f) ’obj_10’
Figure 1: Top: 3D Models of the ’Unambiguous’ Dataset from ’Hinterstoisser’ [15]. Bottom: 3D Models of the ”Ambiguous’ Datasets from ’Hinterstoisser [15] (first two) and T-Less [18] (latter four).

1 Datasets

In Fig 1 we would like to demonstrate all the objects we employed for our experiments. Thereby, the upper row illustrates all objects of the ’Unambiguous’ dataset, taken from ’Hinterstoisser’ [15]. These objects do not exhibit any views, which might induce ambiguities. On the contrary, the lower row depicts all objects of ’Ambiguous’ dataset. While the first two objects also belong to the ’Hinterstoisser’ dataset, the latter four accompany the T-LESS dataset [18]. All these objects can induce ambiguities for certain viewpoints. For instance ’obj 04’ is a symmetric screw, however, possessing distinct textures on its head. Due to this only the views from the bottom (which do not show the texture) are ambiguous. In contrast, for each viewpoint in ’obj 09’ and ’obj 10’, there exists always one identical viewpoint on the other side. Thus, these objects are never ambiguity-free.

2 Synthetic Training Samples

Figure 2: Example from the utilized training datasets. Left: ’T-LESS’ - Right: ’Hinterstoisser’

First, we want to clearly emphasize that we use no real data during training. Instead, we generate only synthetic samples by rendering objects with random poses onto images from the MS COCO dataset [29]. Using OpenGL commands we generate a random pose from a valid range: 360º on the azimuth and altitude along a view sphere, and 180º for inplane rotation. We also vary the radius of the viewing sphere to enable multi-scale detection. In order to increase the variance of the dataset, we add random perturbations such as illumination and contrast changes, among others. This is a similar approach to [23, 41]. However, in contrast to them, for each assigned anchor box, we save exactly one 4D quaternion as the ground truth for the rotation, even if ambiguous.

3 Robust Ambiguity Detection and Estimation

ape bvise cam can cat driller duck holep iron phone
Ambiguity Detection Accuracy [%] 92.8 97.8 96.6 95.1 94.6 90.4 90.3 81.1 99.7 98.7
’eggb’ ’glue’ ’obj 04’ ’obj 05’ ’obj 09’ ’obj 10’
Ambiguity Detection Accuracy [%] 50.5 86.6 94.4 75.5 72.3 94.8
Mean Symmetry Axis Deviation [°] 8.23 16.4 23.2 13.07 13.3 22.5
Table 1: Top: Individual ambiguity detection accuracies for the ’Unambiguous’ dataset. Bottom: Individual ambiguity detection accuracies and mean axis deviations for the ’Ambiguous’ dataset
Figure 3: Qualitative samples for ambiguity detection and ambiguity axis estimation. Thereby the green line illustrates the computed axis and the red axis depicts the ground truth axis.

Tab.1 shows our detailed ambiguity detection results for the ’Unambiguous’ (top) and ’Ambiguous’ (bottom) objects, respectively. In addition, we also report our individual results for the ambiguity axis estimation. We compute the mean deviation from the labeled ground truth. As a threshold for we empirically find to offer good accuracy. Fig. 3 demonstrates more qualitative results for ambiguity detection and the computation of the corresponding ambiguity axis.

3.1 Employing Multiple Hypothesis as Measurement For Reliability

We would like to present more qualitative samples that the hypotheses can be employed as measurement for confidence. To this end, for each object of the ’Unambiguous’ dataset we show the poses possessing the lowest and the highest standard deviation in the hypotheses.

Figure 4: Qualitative examples referring to each object of the ’Unambiguous’ dataset. We show the pose with the lowest (left) and the highest (right) standard deviation in the hypotheses. Thereby, the blue bounding box depicts the ground truth pose, the red bounding box the predicted pose and the red frustums illustrate the hypotheses.

4 2D Object Detection and 6D Pose Estimation

In this section, we present our detailed results for 6D pose estimation and 2D detection. As in the paper, for the ’Unambiguous’ dataset we present our numbers with and for the ’Ambiguous’ dataset we set .

4.1 Unambiguous Object Detection and Pose Estimation

ape bvise cam can cat driller duck holep iron phone mean


Our method 87.6 100.0 86.3 99.7 98.1 97.6 93.5 89.0 99.6 95.4 94.7
SSD-6D [23] 76.3 97.1 92.2 93.1 89.3 97.8 80.0 71.6 98.2 92.4 88.8


Kehl [24] 98.1 94.8 93.4 82.6 98.1 96.5 97.9 97.9 91.0 84.9 93.5
LineMOD [16] 53.3 84.6 64.0 51.2 65.6 69.1 58.0 51.6 68.3 56.3 62.2
Table 2: F1-scores for each sequence of [16]. Note that the ’Hinterstoisser’ scores are supplied from [45] with their evaluation since [16] does not provide them.
ape bvise cam can cat driller duck holep iron phone
Rotational Error [°] 22.0 11.3 14.8 13.5 13.6 18.9 24.7 33.7 9.2 13.5
Translation Error [mm] 38.0 26.9 54.2 27.8 38.5 42.6 52.1 48.9 30.3 42.0
ADD [%] 11.9 66.2 22.4 59.8 26.9 44.6 8.3 17.8 60.8 34.4
VSS [%] 73.9 81.8 77.6 81.9 73.7 75.6 75.3 78.2 83.8 79.7
Table 3: Individual 6D pose scores for each sequence of [15].

For 2D object detection, we retrieve slightly better results than [24], although Kehl et al. additionally leverage depth data. We can also report a significant increase in detection compared to [23]. We generally detect all objects very robustly. Even for the small ’ape’ object with little texture, we can report an F1-score of 87.6%, which is an improvement of more than 10% compared to [23]. In Table 3, we present our individually employed pose estimation metrics. Additionally, below we show one qualitative sample for each object.

Input Image Input Image Input Image Input Image Input Image Input Image Input Image Input Image Input Image Input Image Input Image Input Image
(a) Input Image
(b) 2D Detections
(c) 6D Pose and Associated Hypotheses
Figure 5: Qualitative results for the unambiguous objects.

4.2 Ambiguous Object Detection and Pose Estimation

Object ’eggb’ ’glue’ obj_04 obj_05 obj_09 obj_10
Scene 5 9 2 3 9 5 11 5 11
ADI [%] 54.0 54.1 16.3 14.0 83.3 68.9 76.7 72.0 61.6 39.8 38.2
VSS [%] 83.5 74.3 75.8 81.1 90.6 86.3 89.4 81.6 80.9 82.4 83.2
Table 4: Detailed evaluation scores the ’Ambiguous’ dataset.

Since comparing against the ground truth is not suitable in a multiple hypothesis scenario, only metrics that do not rely on this value are apt for this case. We thus chose the Visual Surface Similarity [23, 31] and Average Distance of Indistinguishable points [19] as metrics for pose. We always take the detection with the highest confidence. We present our individual scores for the ’Ambiguous’ dataset in Tab 4. Additionally, below we show one qualitative sample for each object.

Input Image Input Image Input Image Input Image Input Image Input Image
(a) Input Image
(b) 2D Detections
(c) 6D Pose and Associated Hypotheses
Figure 6: Qualitative results for the ambiguous objects.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, X. Zheng, and G. Brain.

    TensorFlow: A System for Large-Scale Machine Learning TensorFlow: A system for large-scale machine learning.

    In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16), pages 265–284, 2016.
  • [2] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. CVPR, pages 95–104, 2017.
  • [3] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother. Learning 6D Object Pose Estimation using 3D Object Coordinates. In ECCV, 2014.
  • [4] B. Busam, T. Birdal, and N. Navab. Camera pose filtering with local regression geodesics on the riemannian manifold of dual quaternions. In ICCV Workshop, pages 2436–2445, 2017.
  • [5] H. Cai, T. Werner, and J. Matas. Fast detection of multiple textureless 3-D objects. In ICVS, 2013.
  • [6] M. Cicconet, V. Birodkar, M. Lund, M. Werman, and D. Geiger. A convolutional approach to reflection symmetry. Pattern Recognition Letters, 95(1):44–50, 2017.
  • [7] D. Comaniciu, P. Meer, and S. Member. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:603–619, 2002.
  • [8] T. Do, M. Cai, T. Pham, and I. D. Reid. Deep-6dpose: Recovering 6d object pose from a single RGB image. CoRR, abs/1802.10367, 2018.
  • [9] F. Eaton and Z. Ghahramani. Choosing a Variable to Clamp: Approximate Inference Using Conditioned Belief Propagation. In

    Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics

    , pages 145–152, 2009.
  • [10] M. Elawady, C. Ducottet, O. Alata, C. Barat, and P. Colantoni. Wavelet-based reflection symmetry detection via textural and color histograms. ICCV Workshop, pages 1734–1738, 2017.
  • [11] R. Hartley, J. Trumpf, Y. Dai, and H. Li. Rotation averaging.

    International journal of computer vision

    , 103(3):267–305, 2013.
  • [12] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick. Mask R-CNN. In ICCV, pages 2980–2988, 2017.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016.
  • [14] S. Hinterstoisser, C. Cagniart, S. Ilic, P. Sturm, N. Navab, P. Fua, and V. Lepetit. Gradient Response Maps for Real-Time Detection of Textureless Objects. TPAMI, 2012.
  • [15] S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Konolige, N. Navab, and V. Lepetit. Multimodal templates for real-time detection of texture-less objects in heavily cluttered scenes. In ICCV, 2011.
  • [16] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In ACCV, pages 548–562, Berlin, Heidelberg, 2013. Springer-Verlag.
  • [17] S. Hinterstoisser, V. Lepetit, P. Wohlhart, and K. Konolige. On pre-trained image features and synthetic images for deep learning. CoRR, abs/1710.10710, 2017.
  • [18] T. Hodan, P. Haluza, Š. Obdrzalek, J. Matas, M. Lourakis, and X. Zabulis. T-LESS: An RGB-D dataset for 6D pose estimation of texture-less objects. Proceedings - 2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017, pages 880–888, 2017.
  • [19] T. Hodan, J. Matas, and S. Obdrzalek. On Evaluation of 6D Object Pose Estimation. In ECCV Workshop, 2016.
  • [20] T. Hodan, X. Zabulis, M. Lourakis, S. Obdrzalek, and J. Matas. Detection and Fine 3D Pose Estimation of Textureless Objects in RGB-D Images. In IROS, 2015.
  • [21] H. Karcher. Riemannian center of mass and mollifier smoothing. Communications on pure and applied mathematics, 30(5):509–541, 1977.
  • [22] W. Ke, J. Chen, J. Jiao, G. Zhao, and Q. Ye. Srn: Side-output residual network for object symmetry detection in the wild. In CVPR, pages 302–310. IEEE, 2017.
  • [23] W. Kehl, F. Manhardt, S. Ilic, F. Tombari, and N. Navab. SSD-6D: Making RGB-Based 3D Detection and 6D Pose Estimation Great Again. In ICCV, 2017.
  • [24] W. Kehl, F. Milletari, F. Tombari, S. Ilic, and N. Navab. Deep Learning of Local RGB-D Patches for 3D Object Detection and 6D Pose Estimation. In ECCV, 2016.
  • [25] W. Kehl, F. Tombari, N. Navab, S. Ilic, and V. Lepetit. Hashmod: A Hashing Method for Scalable 3D Object Detection. In BMVC, 2015.
  • [26] D. P. Kingma and J. L. Ba. Adam: a Method for Stochastic Optimization. International Conference on Learning Representations 2015, pages 1–15, 2015.
  • [27] I. Kokkinos. Ubernet: Training a ‘universal’ convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In CVPR, 2017.
  • [28] G. Lin, C. Shen, Q. Shi, A. V. D. Hengel, and D. Suter.

    Fast Supervised Hashing with Decision Trees for High-Dimensional Data.

    In CVPR, 2014.
  • [29] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
  • [30] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-y. Fu, and A. C. Berg. SSD : Single Shot MultiBox Detector. In ECCV, 2016.
  • [31] F. Manhardt, W. Kehl, N. Navab, and F. Tombari. Deep model-based 6d pose refinement in rgb. In ECCV, September 2018.
  • [32] M. Ovsjanikov, J. Sun, and L. Guibas. Global intrinsic symmetries of shapes. Eurographics Symposium on Geometry Processing, 27(5):1341–1348, 2008.
  • [33] L. Pinto and A. Gupta.

    Supersizing self-supervision: Learning to grasp from 50k tries and 700 robot hours.

    In ICRA, pages 3406–3413, May 2016.
  • [34] B. Planche, Z. Wu, K. Ma, S. Sun, S. Kluckner, T. Chen, A. Hutter, S. Zakharov, H. Kosch, and J. Ernst. Depthsynth: Real-time realistic synthetic data generation from cad models for 2.5 d recognition. IEEE International Conference on 3DVision, 2017.
  • [35] M. Rad and V. Lepetit. Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In ICCV, pages 3848–3856, 2017.
  • [36] M. Rad, M. Oberweger, and V. Lepetit. Feature Mapping for Learning Fast and Accurate 3D Pose Inference from Synthetic Images. 2018.
  • [37] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.
  • [38] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
  • [39] C. Rupprecht, I. Laina, R. DiPietro, and M. Baust. Learning in an uncertain world: Representing ambiguity through multiple hypotheses. In ICCV, pages 3611–3620, 2017.
  • [40] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR, abs/1409.1, 2014.
  • [41] M. Sundermeyer, Z.-C. Marton, M. Durner, M. Brucker, and R. Triebel. Implicit 3d orientation learning for 6d object detection from rgb images. In ECCV, September 2018.
  • [42] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.

    Inception-v4, inception-resnet and the impact of residual connections on learning.

    In ICLR Workshop, 2016.
  • [43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going Deeper with Convolutions. In CVPR, 2015.
  • [44] K. Tateno, F. Tombari, I. Laina, and N. Navab. Cnn-slam: Real-time dense monocular slam with learned depth prediction. In CVPR, volume 2, 2017.
  • [45] A. Tejani, D. Tang, R. Kouskouridas, and T.-k. Kim. Latent-class hough forests for 3D object detection and pose estimation. In ECCV, 2014.
  • [46] B. Tekin, S. N. Sinha, and P. Fua. Real-time seamless single shot 6d object pose prediction. In CVPR, June 2018.
  • [47] E. Weiszfeld. Sur le point pour lequel la somme des distances de n points donnés est minimum. Tohoku Mathematical Journal, First Series, 43:355–386, 1937.
  • [48] P. Wohlhart and V. Lepetit. Learning Descriptors for Object Recognition and 3D Pose Estimation. In CVPR, 2015.
  • [49] J. Yao, S. Fidler, and R. Urtasun. Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation. In CVPR, pages 702–709. IEEE Computer Society, 2012.
  • [50] X. Yu, S. Tanner, N. Venkatraman, and F. Dieter. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. In RSS, 2018.