1 Introduction
Driven by deep learning, imagebased object detection has recently made a tremendous leap forward in both accuracy as well as efficiency
[38, 13, 30, 37]. An emerging research direction in this field is the estimation of the object’s pose in 3D space over the existing 6DegreesofFreedom (DoF) rather than on the 2D image plane
[23, 46, 8]. This is motivated by a strong interest in achieving robust and accurate monocular 6D pose estimation for applications in the field of robotic grasping, scene understanding and augmented/mixed reality, where the use of a 3D sensor is not feasible
[33, 27, 49, 44].Nevertheless, 6D pose estimation from RGB is a challenging problem due to the intrinsic ambiguity caused by visual appearance of objects under different viewpoints and occlusion. Indeed, most common objects exhibit shape ambiguities and repetitive patterns that cause their appearance to be very similar under different viewpoints, thus rendering pose estimation a problem with multiple correct solutions. Furthermore, also occlusion (from the same object or from others) can cause pose ambiguity.
For example, as illustrated in Figure 1, the cup is identical from every viewpoint along the continuous grey arc. Thus, from a single image, it is impossible to univocally estimate the current object pose. Moreover, object symmetry can also induce visual ambiguities leading to multiple poses with the same visual appearance. However, most datasets do not reflect this ambiguity, since ground truth pose annotations are mostly uniquely defined at each frame. This is problematic for a proper optimization of the rotation, since a visually correct pose still results in a high loss. Therefore, many recent 3D detectors avoid to directly regress the rotation and, instead, explicitly model the solution space in an unambiguous fashion [35, 23].
Essentially, in [23]
, the authors train their convolutional neural network (CNN) by mapping all possible pose solutions for a certain viewpoint onto an unambiguous arc on the view sphere. Rad
et al. [35]employ a separate CNN solely trained to classify the symmetry in order to resolve these ambiguities. However, this simplification exhibits several downsides, such as the explicit inclusion of information about certain symmetries in each trained object. Moreover, this is not always easy to model, as
e.g. in the case of partial view ambiguity. Further, all these approaches rely on prior knowledge and annotation of the object symmetries and aim to solve the ambiguity by providing a single outcome in terms of estimated pose and object. Added to this, these methods are also unable to deal with ambiguities generated by other common factors such as occlusion.On the contrary, Sundermeyer et al. [41] recently proposed a novel method to conduct pose estimation in an ambiguityfree manner. To this end, they first utilize SSD for localization. Subsequently, they employ an augmented autoencoder on each 2D detection and conduct a NN lookup in the feature space to retrieve the detection’s rotation. Hence, they can implicitly deal with ambiguities since views with identical visual appearance are mapped to the same location in the latent space. Although the method is able to deal with ambiguities implicitly, it does not model their detection and description explicitly.
In this paper we propose to model the ambiguity of the object detection and pose estimation tasks directly by allowing our learned model to predict multiple solutions, or hypotheses, for a given object’s visual appearance. Inspired by Rupprecht et al. [39]
we propose a novel architecture and loss function for monocular 6D pose estimation by means of multiple predictions. Essentially, each predicted hypothesis is a 6D pose and object class. When the visual appearance is ambiguous, the model predicts a point estimate of the distribution in 3D pose space. Conversely, when the object’s appearance is unique, the hypotheses will collapse into the same solution. Importantly, our model is capable of learning the distribution of these 6D hypotheses from one single ground truth pose per sample, without further supervision.
Besides providing more insight and a better explanation for the current task, the additionally gained knowledge can be exploited to improve the accuracy of the pose estimation. In essence, analyzing the distribution of the hypotheses enables us to classify if the current perceived viewpoint is ambiguous and to compute the axis of ambiguity for the specific object and viewpoint. Subsequently, when ambiguity is detected, we can employ mean shift [7] clustering over the hypotheses in quaternion space to find the main modes for the current pose. A robust averaging in 3D rotation space for each mode then yields a highly accurate pose estimation. When the view is ambiguityfree, we can improve our pose estimates by robustly averaging over all 6D hypotheses, and by taking advantage of the predicted pose distribution as a confidence measure.
2 Related Work
We first review recent work in object detection and pose estimation for 2D and 3D data. Afterwards, we discuss common grounds and main differences with approaches aimed at symmetry detection for 3D shapes.
Object Detection and Pose Estimation.
Methods for object detection and pose estimation can be subdivided into two categories, i.e. RGBD and RGB based methods. While the former is dominating this sector, the latter is currently gaining interest.
Hinterstoisser et al. [16] proposed a novel image representation for template matching using image gradient orientations to robustly differentiate between different viewpoints. By using hashing to retrieve candidate templates, [5, 25, 20] achieve a sublinear matching complexity with respect to the number of objects.
Further recent research focuses on learningbased methods. For instance, [3, 45]
employ random forests to detect objects in the scene and estimate the associated 6D pose. Certainly, deep learning methods have been proposed recently. For example,
[48, 24] employ CNNs to learn an embedding space for the pose and class from RGBD data, which can subsequently be utilized for retrieval. Notably, the majority of most recent deep learning based methods focus on RGB as input [23, 35, 8, 46, 50, 41]. Since utilizing pretrained networks often accelerates convergence and leads to better local minima, these methods are usually grounded on stateoftheart backbones for 2D object detection, such as Inception [43] or ResNet [13]. In particular, Kehl et al. [23] employ SSD [30] with an InceptionV4 [42] backbone and extend it to also classify view point and inplane rotation. Similarly, Sundermeyer et al. [41] also use SSD for localization, but employ an augmented autoencoder for the unambiguous retrieval of the associated 6D pose. Rad et al. [35] utilize VGG [40] and augment it to provide the 2D projections of the 3D bounding box corners. A similar approach is chosen by [46], based on YOLO [37]. Afterwards, both apply PP to fit the associated 3D bounding box into the regressed 2D projections, in order to estimate the 3D pose of the detection. In [50], Xiang et al. compute a shared feature embedding for subsequent object instance segmentation paired with pose estimation. Finally, Do et al. [8] extend MaskRCNN [12] by a third pose branch, which additionally provides the 3D rotation and the distance to the camera for each prediction. Afterwards, they compute the center of the 2D bounding box and backproject it to 3D, using the regressed depth value.Object Symmetry Detection
Oftentimes, object pose ambiguity arises from symmetric shapes. We review relevant methods that aim at extracting symmetry from 3D models to outline commonalities and differences with our approach.
Most methods for symmetry detection are found in the shape analysis community. Among the different kinds of symmetries, axial symmetries are of particular interest, and multiple approaches have been proposed. Most approaches rely on feature matching or spectral analysis: [9]
treat the problem as a correspondence matching task between a series of keypoints on an object, determining the reflection symmetry hyperplane as an optimization problem. Elawady
et al. [10]rely on edge features extracted using a LogGabor filter in different scales and orientations coupled with a voting procedure on the computed histogram of local texture and color information. In addition,
[6] and [32] are also grounded on waveletbased approaches. Recently, Neural network approaches have also been proposed. Ke et al. [22] adapt an edgedetection architecture with multiple residual units and successfully apply it to symmetry detection using realworld images.Notably, all these approaches aim at detecting symmetries of 3D shapes alone, while our focus is to model the ambiguity arising from objects under specific viewpoints with the goal of improving and explaining pose estimation.
3 Methodology
In this section, we describe our method of handling symmetries and other ambiguities for object detection and pose estimation in detail. We will first define what we understand as an ambiguity.
3.1 Ambiguity in Object Detection and Pose Estimation
We describe the rigid body transformations via the semidirect product of and
. While for the latter, we use Euclidean 3vectors, the algebra
of unit quaternions is used to model the spatial rotations in . A quaternion is given by(1) 
with and . To this end, we regress quaternions above the hyperplane and thus omit the southern hemisphere, such that any possible 3D rotation can be expressed by precisely one single quaternion.
In essence, a direct naive regression of the rotation as a quaternion will lead to poor results, as the network will learn to predict a rotation that is closest to all results under symmetries. This prediction can be seen as the (conditional) mean rotation. More formally, in a typical supervised setting we associate images with poses in a dataset where . To describe symmetries, we define for a given image , the set of poses that all have an identical image
(2) 
Note that in the case of nondiscrete symmetries the set will contain infinitely many poses, which in turn transforms the sums of in the following to integrals. For the sake of a simpler notation and a finite training set in practice, we chose to continue with a notion of a finite . The naive model , that directly regresses a pose from optimizes a loss by minimizing
(3) 
over the training set. However, due to the symmetries, the mapping from to is not well defined and cannot be modeled as a function. By minimizing Equation 3, is learned to predict a pose approximating all possible poses for this image equally well.
(4) 
This is an unfavorable result since is chosen to minimize the sum of all losses towards the different symmetries. In the following section, we will describe how we model these ambiguities inside our method.
3.1.1 Multiple Pose Hypotheses
The key idea behind the proposed method is to model the ambiguity by allowing multiple pose predictions from the network. In order to predict pose hypotheses from , we extend the notation to where now returns pose hypotheses for each image .
For training, the idea is not to punish all hypotheses given the current pose annotation, since they might be correct under ambiguities. Thus, we use a loss that optimizes only one of the hypotheses for each annotation. The most intuitive choice is to pick the closest one. We adapt the meta loss from [39] that operates on ,
(5) 
while we use the original pose loss for each
(6) 
However, the hard selection of the minimum in equation 6 does not work in practice as some of the hypothesis functions might never be updated if they are initialized far from the target values. Thus, we relax to by adding the average error for all hypotheses with an epsilon weight
(7)  
The normalization constants before the two components are designed to give a weight of to the original formulation and to the gradient distributed over all other hypotheses. When , then . This is necessary since the average in the second term already contains the minimum from the first term.
3.2 Architecture
We employ SSD300 with an extended InceptionV4 [42] backbone and adjust it to also provide the 6D pose along with each detection. In particular, we append two more ’ReductionB’ blocks to the backbone. Essentially, we branch off after each dimensionality reduction block and place in total anchor boxes to cover objects at different scales. Moreover, to include the unambiguous regression of the 6D pose, we modify the prediction kernel such that it provides outputs for each anchor box. Thereby, denotes the number of classes, denotes the number of hypotheses, and denotes the number of parameters to describe the 6D pose. In our case, for each of the predicted hypotheses, we regress values to characterize the 6D pose, composed of a 4D quaternion for the 3D rotation and the object’s distance towards the camera. We can estimate the remaining two degreesoffreedom by backprojecting the center of the 2D bounding box using the inferred depth.
Finally, given a set of positive boxes Pos and hardmined negative boxes Neg for a training image, we minimize the following energy function:
(8) 
For the class and the refinement of the anchor boxes, we employ the crossentropy loss and the smooth L1norm , respectively. In order to compare the similarity of two quaternions, we compute the angle between the estimated rotation and the ground truth rotation according to
(9) 
Additionally, we employ the smooth L1norm as loss for the depth component
(10) 
Altogether, we define the final loss for each hypothesis and input image as follows
(11) 
3.3 Processing Multiple Hypotheses
During inference we further analyze the predicted multiple hypotheses in order to determine wether pose of the object is ambiguous. Additionally, in case we detect an ambiguity, we can also exploit the multiple hypotheses to estimate the viewdependent axes of ambiguity.
Detection of Visual Ambiguities in Scenes.
We analyze the distribution of the predicted hypotheses in quaternion space to determine whether the pose exhibits an ambiguity. To this end, Principal Component Analysis (PCA) is performed on the quaternion hypotheses
. The singularvalue decomposition of the data matrix indicates the ambiguity. In particular, if the dominant singular value
(), an ambiguity in the pose prediction is likely, while small singular values imply a collapse to one single unambiguous solution.We determine the existence and type of ambiguity by thresholding the value of . Empirically, we find the criteria to offer good estimations for ambiguity, and the ratio to reveal the ambiguity type in a robust way. When , the ambiguity covers the entire axis, whereas being constrained to an arc otherwise. It is noteworthy that we can learn to detect ambiguities without further supervision, directly from standard datasets.
Estimation of the Axis of Ambiguity.
As mentioned, very prominent representatives for visual ambiguities are symmetries in the objects of interest, as illustrated in Fig. 3 (left) and (mid). Nevertheless, for other objects such as cups, also (self) occlusion can induce ambiguities in appearance, as demonstrated in Fig. 3 (right).
In order to calculate a viewpoint dependant ambiguity axis, we take a closer look at the scenario. A rotation rotates the camera to around the rotation axis
(12) 
All these rotation axes lie in the same plane which is perpendicular to the ambiguity axis . Thus, if we stack the rotation axes , we can formulate the overdetermined linear equation system . The ambiguity axis can be found as the solution to the optimization problem
(13) 
which we solve for
to be more robust against outliers than with the L2norm.
3.4 From Multiple Hypotheses to 6D Pose
After analyzing the distribution of the hypotheses, we can robustly compute the associated 6D pose for each case.
Unambiguous Object Pose.
In case of an unambiguous object pose, we utilize the multiple hypotheses as an input for a geometric median (geodesic mean [11]) to robustify the overall estimation
(14) 
The iterative calculation follows the Weiszfeld algorithm [47] in the tangent spaces to the quaternion hypersphere [4]. From a statistical perspective, our rotation measures are treated as inputs for an estimator to robustly detect the geometric median where gives the geodesic distance on the quaternion hypersphere. In addition, we compute the median depth of all hypotheses. Afterwards, we utilize the center of the 2D detection and backproject it into 3D to obtain the translation and therewith the full 6D pose of the detection.
Ambiguous Object Pose.
As the number of possible 3D rotations is finite yet unknown, we employ mean shift [7] to cluster the hypotheses in quaternion space. To this end, the angular distance of the quaternion vectors measures their similarity. This yields either one cluster (if the poses are connected) or multiple (if they are unconnected) as illustrated in Fig. 3. For each cluster we compute a median rotation and the median depth to retrieve the associated 3D translation. Note that we only consider the depths of the hypotheses, which contributed to the corresponding cluster. We conduct simple contour checks in RGB in order to find the best fitting cluster from wich we extract the final 6D pose.
4 Evaluation
In this section, we first introduce our experimental setup. Afterwards, we demonstrate the robustness of distinguishing wether a view exhibits an ambiguity. Third, we report our 6D pose estimation accuracy for the unambiguous and the ambiguous case. Finally, we demonstrate how we can model uncertainty in pose estimation by analyzing the variance across hypotheses.
We build two datasets, for each case respectively. In particular, for the unambiguous dataset, we use the popular ‘Hinterstoisser’ dataset [14]. However, we moved the ‘glue’ and ‘eggbox’ object to the ambiguous dataset, since both exhibit several views (mostly from the top), which are not unique. Additionally, following [23, 35] we removed the ‘cup’ and ‘bowl’ objects, because no watertight CAD models are provided for them. We also discard the ‘lamp’ since the CAD model does not possess correct normal vectors for proper rendering. To the ambiguous dataset, besides the ‘glue’ and ‘bowl’ objects, we added several models from TLESS [18] to cover different types of ambiguities. In essence, TLESS mostly consists of symmetric and textureless industrial objects. For our experiments we choose a subset that covers both cases: complete rotational symmetry along an axis (object 4) and objects with more than one rotational symmetry (object 5, 9, 10).
4.1 Experimental Setup
As noted in [17], domain adaptation between synthetically generated data samples and realworld images simplifies the collection of training data, however at the cost of performance. Nevertheless, we render CAD models in random poses within a defined range of rotations, adding a series of augmentations, such as illumination changes, shadows and blur, as well as background images taken from the MS COCO dataset [28]. As already discussed, our method attempts to model the presence of rotational symmetries, thus it is not necessary to manually constrain the valid range of rotations per object to avoid ambiguities, which is a common workaround in recent pose estimation systems [23, 35].
We implemented our method in TensorFlow
[1] v1.5 and conducted all our experiments on an Intel i75820K@3.3GHz CPU with an NVIDIA GTX 1080 GPU. For all experiments, we train with a batch size of 10 and use ADAM [26] with a learning rate of .Implementation Details.
There are several hyperparameters that influence the performance of our method. In particular, the relaxation weight
influences the convergence to a reference ground truth. If the value is too small, only the hypothesis closest to the ground truth will be updated, causing poses to cluster around that particular point, whereas a large value will result in loose hypotheses predictions. To overcome this issue, at the beginning, we distribute comparably more loss such that all hypotheses are able to learn appropriate results, before we force them to either specialize or gather. To this end we interpolate
from to during training.The choice of and in (8) is also influential. In our experiments, we find that setting works well in general. In contrast, has to be chosen more carefully. Similar to , a linear increasing value for works better in practise. Essentially, a large value of hinders the network to properly focus on learning to robustly detect the object. Nevertheless, the loss for rotation tends quickly to converge when choosing too small. Consequently, we set at the beginning and linearly increase it to during training.
Lastly, also the number of hypotheses influences the accuracy of the method. A small number of hypotheses might not be able to convey inputs of high ambiguity, and hinders clustering, whereas a large value results in redundant information. We validate all experiments with different to understand it’s influence.
Evaluation metrics.
In order to properly assess the 6D pose performance, we distinguish again between ambiguous and nonambiguous objects. When dealing with nonambiguous objects, we report the absolute error for the 3D rotation in degrees and 3D translation in millimeters. We also show our accuracy using the Average Distance of Distinguishable Model Points (ADD) metric from [16], which measures if the average deviation of the transformed model points is less than 10% of the object’s diameter.
We also report our results for the Visual Surface Similarity (VSS) metric. As [23, 31], we define VSS similar to the Visual Surface Discrepancy (VSD) [19], however, set . Hence, we measure the pixelwise overlap of the rendered ground truth pose and the rendered prediction. We decide to employ this metric, since it can be applied to both, ambiguous and ambiguityfree objects.
4.2 Shape Ambiguity Analysis
For each image of the ambiguous dataset, we manually annotated whether the current object view exhibits ambiguity based on the visible object texture and shape. This ground truth is used to quantitatively assess the capability of our method to detect pose ambiguity. Additionally, we compute the ground truth symmetry axes for each object. It is important to note that we do not conduct object symmetry detection, instead, we describe the perceived pose ambiguity in terms of a symmetry axis. These annotations are only used for evaluation and not during training.
In Fig. 5 (top) we first show how robustly we can detect ambiguities. Additionally, for each detected ambiguity, we compute the average discrepancy of the computed symmetry axis with the annotated ground truth axis in degrees. In essence, for the ambiguityfree case, we can report an accuracy of more than 90%, while for the ambiguous case we can also state a high accuracy of approximately 83% correctly classified views. Furthermore, the mean axis only deviates by 16°, which shows that our formulation is able to precisely explain the perceived ambiguity.
On the bottom, we respectively show one sample of estimated ambiguity axes from ‘Hinterstoisser’ and ‘TLESS’. For each detection, we draw the estimated axis in red, while the green line denotes the handannotated groundtruth axis.
4.3 6D Pose Estimation and 2D Object Detection
We evaluate the accuracy of our method in case of ambiguous and unambiguous pose estimation. Furthermore, Fig 4 shows one qualitative example for each case. While all hypotheses (red frustums) collapse to one pose for the unambiguous views, for the ambiguous views they spread around the calculated axis of ambiguity (red line).
Unambiguous Pose Estimation.
In Tab 1 we report our results for the unambiguous subset. In particular, even for the single hypothesis case, our approach outperforms SSD6D by appoximately 35% of relative error while also being more robust in terms of 2D detection. Comparing with Sundermeyer et al. [41] we can report a relative improvement of more than 50% referring to the ADD metric. In addition, our robust averaging over all hypotheses leads to another improvement of all metrics. Interestingly, for the unambiguous dataset it turns out that works best in terms of 6D pose and also 2D detection. Employing too many hypotheses can hinder convergence since some hypotheses tend to starve depending on the initialization and when too few hypotheses are used, the model cannot benefit from the averaging in 6D.
In contrary, [35] and [46] achieve a little higher ADD score. This can be mostly credited to the fact that they train on real data, in order to be able to close the domain gap between real and synthetic data. More precisely, they use a subset of the test data and crop the objects of interest. Thereafter, they place them on top of a random background image from MS COCO. In contrast to these approaches, [23], [41] and we, all train on purely synthetic data, which allows to train for any object, provided the CAD model. Generally, synthetic training is desired, since it saves time and money for annotating the data, however, commonly also leads to a decrease in performance. Therefore, a lot of research is currently focusing on further closing this domain gap [2, 17, 34, 36], which this work contributes to.
Rot. [°]  Trans. [mm]  VSS [%]  ADD [%]  F1  

17.9  45.6  76.8  31.2  91.6  
18.9  44.3  76.3  32.8  92.1  
17.5  40.6  78.2  35.3  94.7  
19.2  45.6  77.2  31.3  90.6  
18.7  44.6  77.4  33.8  92.7  
19.2  44.9  77.3  32.7  91.0  
[41]  –  –  –  22.1  – 
SSD6D [23]  28.0  72.4  67.4  9.4  88.8 
BB8 [35]        45.9  – 
Tekin [46]        47.9  – 
Ambiguous Pose Estimation.
Referring to Tab 1, for the ambiguous ‘Hinterstoisser’ objects, we can report a VSS score of 79% and an ADI score of 54%, which is a relative improvement of approximately 13% and 145% compared to SSD6D. Surprisingly, the multiple hypothesis detector overall achieves similar performance as the single hypothesis predictor for the 6D pose. However, for the 2D detection case, we are able to increase the accuracy from 79% to 94%. As constituted, only a few views are not well defined for these objects. During investigating the results, we discovered that the single hypothesis predictor is not capable of understanding exactly these ambiguous views and tends to simply discard them. In contrast, the multiple hypotheses predictor is indeed able to understand these views and gives reliable pose predictions.
For the ambiguous TLESS objects (Tab 4), our multiple hypotheses approach surpasses the single hypothesis estimator for all objects. In essence, the single hypothesis estimator trained and evaluated under the same conditions is not able to capture the ambiguities in pose and, thus, is not able to produce equally accurate results. Only when the ambiguities could be resolved, the single hypothesis predictor was able to compute precise poses. Sundermeyer et al. [41] is able to exceed our model by 4% and 3%, referring to VSS and ADI respectively. Since the colors of the CAD models differ widely from the real scenes, generalization to these objects is very difficult from RGB information only. However, Sundermeyer et al. make use of real data for training their 2D detector on TLESS, however, this has also a direct impact on their 6D pose estimates. First, they infer depth from the aspect ratio of the 2D bounding boxes. Second, deducing the rotation in a twostage fashion simplifies the overall problem, yet, comes with a linear cost in runtime. In conclusion, we can state overall good performance, while relying on synthetic training data only, whereas [41] require real data to enable training on TLESS.
VSS [%]  ADI [%]  

Scene  [41]  [41]  
obj_04  5, 9  72.2  68.6  78.5  17.8  14.1  15.2 
obj_05  2, 3, 4  84.6  82.8  88.8  65.3  48.3  76.3 
obj_09  5, 11  81.8  79.1  86.5  66.8  54.5  77.3 
obj_10  5, 11  82.8  78.5  82.3  39.0  29.4  31.9 
mean  80.2  77.3  84.0  47.2  36.6  50.6 
VSS [%]  ADI [%]  F1  

[23]  [23]  [23]  
eggbox  83.5  78.5  76.3  54.1  56.0  26.3  98.0  83.0  93.7 
glue  74.4  74.0  65.1  54.0  58.7  17.6  90.3  74.0  76.8 
mean  78,9  76.3  70.7  54.1  57.4  22.0  94.2  78.5  85.5 
4.4 Employing Multiple Hypothesis as Measurement for Reliability
To the best of our knowledge, there is no prior work where both the confidence in class and detection, as well as the confidence in the continuous pose estimate is modeled. Yet, this knowledge can highly improve the overall robustness and accuracy. In our case, besides the additional information on the viewpoint ambiguity, we can utilize the different hypotheses as a measurement for the confidence in the unambiguous 6D pose. To quantify the effect of this, we report our test results on the unambiguous subset of ‘Hinterstoisser’ in Fig. 6
(top), where we compute a confidence measure via the standard deviation with respect to the Karcher mean
[21].Naturally, a lower standard deviation means more accurate pose estimates. For instance, by only allowing poses with a standard deviation of less than , we can strongly improve each metric, while only losing approximately of all pose estimates. Specifically, the rotational error decreases by approximately 20% (i.e. from °to °). Additionally, the translation error also decreases a little from 44.8mm to 43.0mm. Accordingly, employing an even lower threshold for the standard deviation (e.g. ) gives another significant improvement for the pose (especially in rotation), however, for the cost of rejecting more estimates. In addition, the qualitative example on the bottom also confirms these results. Essentially, while the pose of the ‘driller’ with the lowest standard deviation is very accurate, on the contrary, the pose possessing the highest is rather imprecise. We experienced the same behavior for any object of the unambiguous dataset. Summarizing, one can ascertain from Fig 6 that the standard deviation of the pose hypotheses can be utilized as a robust measurement for reliability of the pose.
STD  Rot. [°]  Trans. [mm]  VSS [%]  ADD [%]  Rejects [%] 

11.8  39.4  80.0  37.7  32.6  
13.8  41.3  79.1  35.5  18.2  
15.5  43.0  78.3  34.3  10.5  
17.3  44.0  77.7  33.4  4.0  
19.2  44.8  77.3  32.7  0.0 
5 Conclusion
We propose a new approach for pose estimation that implicitly models ambiguities. Our method performs pose regression in quaternion space formulating a quaternionspecific distance measure as loss function. We show that our approach models rotational ambiguity without requiring any input preprocessing as well as the feasibility of domain adaptation between synthetic and real data. We can estimate the axis of rotational ambiguity and perform pose refinement based on clustering without requiring to know the number of clusters in advance. Lastly, we argue that our method constitutes a metric of uncertainty for the 6D pose.
Our experiments show that our method is suitable for detecting challenging objects that exhibit multiple rotational symmetries as well as datasets with little ambiguity.
In conclusion, we believe that the new formulation of the pose detection problem from images as an ambiguous task paves the way towards interesting applications in the domain of robotic interactions and automation. Moreover, it opens the space for discussions about proper evaluation metrics that reflect the ambiguity in this domain.
Acknowledgments We would like to thank Toyota Motor Corporation for funding and supporting this work.
Explaining the Ambiguity of Object Detection and 6D Pose from Visual Data
Supplementary Material
1 Datasets
In Fig 1 we would like to demonstrate all the objects we employed for our experiments. Thereby, the upper row illustrates all objects of the ’Unambiguous’ dataset, taken from ’Hinterstoisser’ [15]. These objects do not exhibit any views, which might induce ambiguities. On the contrary, the lower row depicts all objects of ’Ambiguous’ dataset. While the first two objects also belong to the ’Hinterstoisser’ dataset, the latter four accompany the TLESS dataset [18]. All these objects can induce ambiguities for certain viewpoints. For instance ’obj 04’ is a symmetric screw, however, possessing distinct textures on its head. Due to this only the views from the bottom (which do not show the texture) are ambiguous. In contrast, for each viewpoint in ’obj 09’ and ’obj 10’, there exists always one identical viewpoint on the other side. Thus, these objects are never ambiguityfree.
2 Synthetic Training Samples
First, we want to clearly emphasize that we use no real data during training. Instead, we generate only synthetic samples by rendering objects with random poses onto images from the MS COCO dataset [29]. Using OpenGL commands we generate a random pose from a valid range: 360º on the azimuth and altitude along a view sphere, and 180º for inplane rotation. We also vary the radius of the viewing sphere to enable multiscale detection. In order to increase the variance of the dataset, we add random perturbations such as illumination and contrast changes, among others. This is a similar approach to [23, 41]. However, in contrast to them, for each assigned anchor box, we save exactly one 4D quaternion as the ground truth for the rotation, even if ambiguous.
3 Robust Ambiguity Detection and Estimation
ape  bvise  cam  can  cat  driller  duck  holep  iron  phone  
Ambiguity Detection Accuracy [%]  92.8  97.8  96.6  95.1  94.6  90.4  90.3  81.1  99.7  98.7 
’eggb’  ’glue’  ’obj 04’  ’obj 05’  ’obj 09’  ’obj 10’  

Ambiguity Detection Accuracy [%]  50.5  86.6  94.4  75.5  72.3  94.8 
Mean Symmetry Axis Deviation [°]  8.23  16.4  23.2  13.07  13.3  22.5 
Tab.1 shows our detailed ambiguity detection results for the ’Unambiguous’ (top) and ’Ambiguous’ (bottom) objects, respectively. In addition, we also report our individual results for the ambiguity axis estimation. We compute the mean deviation from the labeled ground truth. As a threshold for we empirically find to offer good accuracy. Fig. 3 demonstrates more qualitative results for ambiguity detection and the computation of the corresponding ambiguity axis.
3.1 Employing Multiple Hypothesis as Measurement For Reliability
We would like to present more qualitative samples that the hypotheses can be employed as measurement for confidence. To this end, for each object of the ’Unambiguous’ dataset we show the poses possessing the lowest and the highest standard deviation in the hypotheses.
4 2D Object Detection and 6D Pose Estimation
In this section, we present our detailed results for 6D pose estimation and 2D detection. As in the paper, for the ’Unambiguous’ dataset we present our numbers with and for the ’Ambiguous’ dataset we set .
4.1 Unambiguous Object Detection and Pose Estimation
ape  bvise  cam  can  cat  driller  duck  holep  iron  phone  mean  
RGB 
Our method  87.6  100.0  86.3  99.7  98.1  97.6  93.5  89.0  99.6  95.4  94.7 
SSD6D [23]  76.3  97.1  92.2  93.1  89.3  97.8  80.0  71.6  98.2  92.4  88.8  
RGBD 

Kehl [24]  98.1  94.8  93.4  82.6  98.1  96.5  97.9  97.9  91.0  84.9  93.5  
LineMOD [16]  53.3  84.6  64.0  51.2  65.6  69.1  58.0  51.6  68.3  56.3  62.2 
ape  bvise  cam  can  cat  driller  duck  holep  iron  phone  

Rotational Error [°]  22.0  11.3  14.8  13.5  13.6  18.9  24.7  33.7  9.2  13.5 
Translation Error [mm]  38.0  26.9  54.2  27.8  38.5  42.6  52.1  48.9  30.3  42.0 
ADD [%]  11.9  66.2  22.4  59.8  26.9  44.6  8.3  17.8  60.8  34.4 
VSS [%]  73.9  81.8  77.6  81.9  73.7  75.6  75.3  78.2  83.8  79.7 
For 2D object detection, we retrieve slightly better results than [24], although Kehl et al. additionally leverage depth data. We can also report a significant increase in detection compared to [23]. We generally detect all objects very robustly. Even for the small ’ape’ object with little texture, we can report an F1score of 87.6%, which is an improvement of more than 10% compared to [23]. In Table 3, we present our individually employed pose estimation metrics. Additionally, below we show one qualitative sample for each object.
4.2 Ambiguous Object Detection and Pose Estimation
Object  ’eggb’  ’glue’  obj_04  obj_05  obj_09  obj_10  

Scene  –  –  5  9  2  3  9  5  11  5  11 
ADI [%]  54.0  54.1  16.3  14.0  83.3  68.9  76.7  72.0  61.6  39.8  38.2 
VSS [%]  83.5  74.3  75.8  81.1  90.6  86.3  89.4  81.6  80.9  82.4  83.2 
Since comparing against the ground truth is not suitable in a multiple hypothesis scenario, only metrics that do not rely on this value are apt for this case. We thus chose the Visual Surface Similarity [23, 31] and Average Distance of Indistinguishable points [19] as metrics for pose. We always take the detection with the highest confidence. We present our individual scores for the ’Ambiguous’ dataset in Tab 4. Additionally, below we show one qualitative sample for each object.
References

[1]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga,
S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden,
M. Wicke, Y. Yu, X. Zheng, and G. Brain.
TensorFlow: A System for LargeScale Machine Learning TensorFlow: A system for largescale machine learning.
In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16), pages 265–284, 2016.  [2] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixellevel domain adaptation with generative adversarial networks. CVPR, pages 95–104, 2017.
 [3] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother. Learning 6D Object Pose Estimation using 3D Object Coordinates. In ECCV, 2014.
 [4] B. Busam, T. Birdal, and N. Navab. Camera pose filtering with local regression geodesics on the riemannian manifold of dual quaternions. In ICCV Workshop, pages 2436–2445, 2017.
 [5] H. Cai, T. Werner, and J. Matas. Fast detection of multiple textureless 3D objects. In ICVS, 2013.
 [6] M. Cicconet, V. Birodkar, M. Lund, M. Werman, and D. Geiger. A convolutional approach to reflection symmetry. Pattern Recognition Letters, 95(1):44–50, 2017.
 [7] D. Comaniciu, P. Meer, and S. Member. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24:603–619, 2002.
 [8] T. Do, M. Cai, T. Pham, and I. D. Reid. Deep6dpose: Recovering 6d object pose from a single RGB image. CoRR, abs/1802.10367, 2018.

[9]
F. Eaton and Z. Ghahramani.
Choosing a Variable to Clamp: Approximate Inference Using
Conditioned Belief Propagation.
In
Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics
, pages 145–152, 2009.  [10] M. Elawady, C. Ducottet, O. Alata, C. Barat, and P. Colantoni. Waveletbased reflection symmetry detection via textural and color histograms. ICCV Workshop, pages 1734–1738, 2017.

[11]
R. Hartley, J. Trumpf, Y. Dai, and H. Li.
Rotation averaging.
International journal of computer vision
, 103(3):267–305, 2013.  [12] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick. Mask RCNN. In ICCV, pages 2980–2988, 2017.
 [13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016.
 [14] S. Hinterstoisser, C. Cagniart, S. Ilic, P. Sturm, N. Navab, P. Fua, and V. Lepetit. Gradient Response Maps for RealTime Detection of Textureless Objects. TPAMI, 2012.
 [15] S. Hinterstoisser, S. Holzer, C. Cagniart, S. Ilic, K. Konolige, N. Navab, and V. Lepetit. Multimodal templates for realtime detection of textureless objects in heavily cluttered scenes. In ICCV, 2011.
 [16] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab. Model based training, detection and pose estimation of textureless 3d objects in heavily cluttered scenes. In ACCV, pages 548–562, Berlin, Heidelberg, 2013. SpringerVerlag.
 [17] S. Hinterstoisser, V. Lepetit, P. Wohlhart, and K. Konolige. On pretrained image features and synthetic images for deep learning. CoRR, abs/1710.10710, 2017.
 [18] T. Hodan, P. Haluza, Š. Obdrzalek, J. Matas, M. Lourakis, and X. Zabulis. TLESS: An RGBD dataset for 6D pose estimation of textureless objects. Proceedings  2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017, pages 880–888, 2017.
 [19] T. Hodan, J. Matas, and S. Obdrzalek. On Evaluation of 6D Object Pose Estimation. In ECCV Workshop, 2016.
 [20] T. Hodan, X. Zabulis, M. Lourakis, S. Obdrzalek, and J. Matas. Detection and Fine 3D Pose Estimation of Textureless Objects in RGBD Images. In IROS, 2015.
 [21] H. Karcher. Riemannian center of mass and mollifier smoothing. Communications on pure and applied mathematics, 30(5):509–541, 1977.
 [22] W. Ke, J. Chen, J. Jiao, G. Zhao, and Q. Ye. Srn: Sideoutput residual network for object symmetry detection in the wild. In CVPR, pages 302–310. IEEE, 2017.
 [23] W. Kehl, F. Manhardt, S. Ilic, F. Tombari, and N. Navab. SSD6D: Making RGBBased 3D Detection and 6D Pose Estimation Great Again. In ICCV, 2017.
 [24] W. Kehl, F. Milletari, F. Tombari, S. Ilic, and N. Navab. Deep Learning of Local RGBD Patches for 3D Object Detection and 6D Pose Estimation. In ECCV, 2016.
 [25] W. Kehl, F. Tombari, N. Navab, S. Ilic, and V. Lepetit. Hashmod: A Hashing Method for Scalable 3D Object Detection. In BMVC, 2015.
 [26] D. P. Kingma and J. L. Ba. Adam: a Method for Stochastic Optimization. International Conference on Learning Representations 2015, pages 1–15, 2015.
 [27] I. Kokkinos. Ubernet: Training a ‘universal’ convolutional neural network for low, mid, and highlevel vision using diverse datasets and limited memory. In CVPR, 2017.

[28]
G. Lin, C. Shen, Q. Shi, A. V. D. Hengel, and D. Suter.
Fast Supervised Hashing with Decision Trees for HighDimensional Data.
In CVPR, 2014.  [29] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
 [30] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.y. Fu, and A. C. Berg. SSD : Single Shot MultiBox Detector. In ECCV, 2016.
 [31] F. Manhardt, W. Kehl, N. Navab, and F. Tombari. Deep modelbased 6d pose refinement in rgb. In ECCV, September 2018.
 [32] M. Ovsjanikov, J. Sun, and L. Guibas. Global intrinsic symmetries of shapes. Eurographics Symposium on Geometry Processing, 27(5):1341–1348, 2008.

[33]
L. Pinto and A. Gupta.
Supersizing selfsupervision: Learning to grasp from 50k tries and 700 robot hours.
In ICRA, pages 3406–3413, May 2016.  [34] B. Planche, Z. Wu, K. Ma, S. Sun, S. Kluckner, T. Chen, A. Hutter, S. Zakharov, H. Kosch, and J. Ernst. Depthsynth: Realtime realistic synthetic data generation from cad models for 2.5 d recognition. IEEE International Conference on 3DVision, 2017.
 [35] M. Rad and V. Lepetit. Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In ICCV, pages 3848–3856, 2017.
 [36] M. Rad, M. Oberweger, and V. Lepetit. Feature Mapping for Learning Fast and Accurate 3D Pose Inference from Synthetic Images. 2018.
 [37] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, realtime object detection. In CVPR, 2016.
 [38] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster RCNN: towards realtime object detection with region proposal networks. In NIPS, pages 91–99, 2015.
 [39] C. Rupprecht, I. Laina, R. DiPietro, and M. Baust. Learning in an uncertain world: Representing ambiguity through multiple hypotheses. In ICCV, pages 3611–3620, 2017.
 [40] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for LargeScale Image Recognition. CoRR, abs/1409.1, 2014.
 [41] M. Sundermeyer, Z.C. Marton, M. Durner, M. Brucker, and R. Triebel. Implicit 3d orientation learning for 6d object detection from rgb images. In ECCV, September 2018.

[42]
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
In ICLR Workshop, 2016.  [43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going Deeper with Convolutions. In CVPR, 2015.
 [44] K. Tateno, F. Tombari, I. Laina, and N. Navab. Cnnslam: Realtime dense monocular slam with learned depth prediction. In CVPR, volume 2, 2017.
 [45] A. Tejani, D. Tang, R. Kouskouridas, and T.k. Kim. Latentclass hough forests for 3D object detection and pose estimation. In ECCV, 2014.
 [46] B. Tekin, S. N. Sinha, and P. Fua. Realtime seamless single shot 6d object pose prediction. In CVPR, June 2018.
 [47] E. Weiszfeld. Sur le point pour lequel la somme des distances de n points donnés est minimum. Tohoku Mathematical Journal, First Series, 43:355–386, 1937.
 [48] P. Wohlhart and V. Lepetit. Learning Descriptors for Object Recognition and 3D Pose Estimation. In CVPR, 2015.
 [49] J. Yao, S. Fidler, and R. Urtasun. Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation. In CVPR, pages 702–709. IEEE Computer Society, 2012.
 [50] X. Yu, S. Tanner, N. Venkatraman, and F. Dieter. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. In RSS, 2018.
Comments
There are no comments yet.