HybridPose: 6D Object Pose Estimation under Hybrid Representations

01/07/2020 ∙ by Chen Song, et al. ∙ The University of Texas at Austin 30

We introduce HybridPose, a novel 6D object pose estimation approach. HybridPose utilizes a hybrid intermediate representation to express different geometric information in the input image, including keypoints, edge vectors, and symmetry correspondences. Compared to a unitary representation, our hybrid representation allows pose regression to exploit more and diverse features when one type of predicted representation is inaccurate (e.g., because of occlusion). HybridPose leverages a robust regression module to filter out outliers in predicted intermediate representation. We show the robustness of HybridPose by demonstrating that all intermediate representations can be predicted by the same simple neural network without sacrificing the overall performance. Compared to state-of-the-art pose estimation approaches, HybridPose is comparable in running time and is significantly more accurate. For example, on Occlusion Linemod dataset, our method achieves a prediction speed of 30 fps with a mean ADD(-S) accuracy of 79.2 improvement from the current state-of-the-art approach. Our implementation of HybridPose is available at https://github.com/chensong1995/HybridPose.



There are no comments yet.


page 1

page 3

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Estimating the 6D pose of an object from an RGB image is a fundamental problem in 3D vision and has diverse applications in object recognition and robot-object interaction. Advances in deep learning have led to significant breakthroughs in this problem. While early works typically formulate pose estimation as end-to-end pose classification 

[38] or pose regression [15, 41], recent pose estimation methods usually leverage keypoints as an intermediate representation [37, 33]

, and align predicted 2D keypoints with ground-truth 3D keypoints for pose estimation. In addition to ground-truth pose labels, these methods incorporate keypoints as an intermediate supervision, facilitating smooth model training. Keypoint-based methods are built upon two assumptions: (1) a machine learning model can accurately predict 2D keypoint locations; and (2) these predictions provide sufficient constraints to regress the underlying 6D pose. Both assumptions easily break in many real-world settings. Due to object occlusions and representational limitations of the prediction network, it is often impossible to accurately predict 2D keypoint coordinates from an RGB image alone.

Figure 1: HybridPose predicts keypoints, edge vectors, and symmetry correspondences. In (a), we show the input RGB image, in which the object of interest (driller) is partially occluded. In (b), red markers denote predicted 2D keypoints. In (c), edge vectors are defined by a fully-connected graph among all keypoints. In (d), symmetry correspondences connect each 2D pixel on the object to its symmetric counterpart. For illustrative purposes, we only draw symmetry correspondences of 50 random samples from 5755 predicted object pixels in this example. The predicted pose (f) is obtained by jointly aligning the predictions with the 3D template, which involves solving a non-linear optimization problem.

In this paper, we introduce HybridPose, which leverages multiple intermediate representations to express the geometric information in the input image for pose estimation. In addition to keypoints, HybridPose integrates a prediction network that outputs edge vectors between adjacent keypoints. As most objects possess a (partial) reflection symmetry, HybridPose also utilizes predicted dense pixel-wise correspondences that reflect the underlying symmetric relations between pixels. Compared to a unitary representation, this hybrid representation enjoys a multitude of advantages. First, HybridPose integrates more signals in the input image: edge vectors include object skeleton structure, and symmetry correspondences incorporate interior details. Second, HybridPose offers far more constraints than using keypoints alone for pose regression, enabling accurate pose prediction even if a significant fraction of predicted elements are outliers (e.g., under occlusions). Finally, it can be shown that symmetry correspondences stabilize the rotation component of pose prediction, especially along the normal direction of the reflection plane.

Given the intermediate representation predicted by the first module, the second module of HybridPose performs pose regression. In particular, HybridPose employs trainable robust norms to prune outliers in predicted intermediate representation. We show how to combine pose initialization and pose refinement to maximize the quality of the resulting object pose. We also show how to train HybridPose effectively using a training set for the prediction module, and a validation set for the refinement module.

We evaluate HybridPose on two popular benchmark datasets, Linemod [11] and Occlusion Linemod [3]. In terms of accuracy (under the ADD(-S) metric), HybridPose leads to considerable improvements from all state-of-the-art methods that merely utilize keypoints. On Occlusion Linemod [3], HybridPose achieves an accuracy of 79.2%, which represents a 67.4% improvement from DPOD [43], the current state-of-the-art method on this benchmark dataset.

Despite the gain in accuracy, our approach is efficient and runs at 30 frames per second on a commodity workstation. Compared to approaches which utilize sophisticated network architecture to predict one single intermediate representation, HybridPose achieves considerably better performance by using a relative simple network to predict hybrid representations.

2 Related Works

Figure 2: Approach overview. HybridPose consists of intermediate representation prediction networks and a pose regression module. The prediction networks take an image as input, and output predicted keypoints, edge vectors, and symmetry correspondences. The pose regression module consists of a initialization sub-module and a refinement sub-module. The initialization sub-module solves a linear system with predicted intermediate representations to obtain an initial pose. The refinement sub-module utilizes GM robust norm and optimizes (9) to obtain the final pose prediction.

Intermediate representation for pose. To express the geometric information in an RGB image, a prevalent intermediate representation is keypoints, which achieves state-of-the-art performance [33, 31, 35]. The corresponding pose estimation pipeline combines keypoint prediction and pose regression initialized by the PnP algorithm [17]

. Keypoint predictions are usually generated by a neural network, and previous works use different types of tensor descriptors to express 2D keypoint coordinates. A common approach represents keypoints as peaks of heatmaps 

[27, 46], which becomes sub-optimal when keypoints are occluded, as the input image does not provide explicit visual cues for their locations. Alternative keypoint representations include vector-fields [33] and patches [13]. These representations allow better keypoint predictions under occlusion, and eventually lead to improvement in pose estimation accuracy. However, keypoints alone are a sparse representation of the object pose, whose potential in improving estimation accuracy is limited.

Besides keypoints, another common intermediate representation is the coordinate of every image pixel in the 3D physical world, which provides dense 2D-3D correspondences for pose alignment, and is robust under occlusion [3, 4, 29, 19]. However, regressing dense object coordinates is much more costly than keypoint prediction. They are also less accurate than keypoints due to the lack of corresponding visual cues. In addition to keypoints and pixel-wise 2D-3D correspondences, depth is another alternative intermediate representation in visual odometry settings, which can be estimated together with pose in an unsupervised manner [45]. In practice, the accuracy of depth estimation is limited by the representational power of neural networks.

Unlike previous approaches, HybridPose combines multiple intermediate representations, and exhibits collaborative strength for pose estimation.

Multi-modal input. To address the challenges for pose estimation from a single RGB image, several works have considered inputs from multiple sensors. A popular approach is to leverage information from both RGB and depth images [45, 39, 41]. In the presence of depth information, pose regression can be reformulated as the 3D point alignment problem, which is then sovled by the ICP algorithm [41]. Although HybridPose utilizes multiple intermediate representations, all intermediate representations are predicted from an RGB image alone. HybridPose handles situations in which depth information is absent.

Edge features. Edges are known to capture important image features such as object contours [2], salient edges [22], and straight line segments [44]. Unlike these low-level image features, HybridPose leverages semantic edge vectors defined between adjacent keypoints. This representation, which captures correlations between keypoints and reveals underlying structure of object, is concise and easy to predict. Such edge vectors offer more constraints than keypoints alone for pose regressions and have clear advantages under occlusion. Our approach is similar to [5], which predicts directions between adjacent keypoints to link keypoints into a human skeleton. However, we predict both the direction and the magnitude of edge vectors, and use these vectors to estimate object poses.

Symmetry detection from images.

Symmetry detection has received significant attention in computer vision. We refer readers to 

[21, 26] for general surveys, and [1, 40]

for recent advances. Traditional applications of symmetry detection include face recognition 

[30], depth estimation [20], and 3D reconstruction [12, 42]. In the context of object pose estimation, people have studied symmetry from the perspective that it introduces ambiguities for pose estimation (c.f. [24, 35, 41]), since symmetric objects with different poses can have the same appearance in image. Several works [35, 41, 6, 24, 29]

have explored how to address such ambiguities, e.g., by designing loss functions that are invariant under symmetric transformations.

Robust regression. Pose estimation via intermediate representation is sensitive to outliers in predictions, which are introduced by occlusion and cluttered backgrounds [36, 31, 39]. To mitigate pose error, several works assign different weights to different predicted elements in the 2D-3D alignment stage  [33, 31]. In contrast, our approach additionally leverages robust norms to automatically filter outliers in the predicted elements.

Besides the reweighting strategy, some recent works propose to use deep learning-based refiners to boost the pose estimation performance [18, 25, 43]. [43, 18] use point matching loss and achieve high accuracy. [25] predicts pose updates using contour information. Unlike these works, our approach considers the critical points and the loss surface of the robust objective function, and does not involve a fixed pre-determined iteration count used in recurrent network based approaches.

3 Approach

The input to HybridPose is an image containing an object in a known class, taken by a pinhole camera with known intrinsic parameters. Assuming that the class of objects has a canonical coordinate system (i.e. the 3D point cloud), HybridPose outputs the 6D camera pose of the image object under , where is the rotation component and is the translation component.

3.1 Approach Overview

As illustrated in Figure 2, HybridPose consists of a prediction module and a pose regression module.

Prediction module (Section 3.2). HybridPose utilizes three prediction networks , , and to estimate a set of keypoints , a set of edges between keypoints , and a set of symmetry correspondences between image pixels . , , and are all represented in 2D. , , and are trainable parameters.

The keypoint network employs an off-the-shelf prediction network [33]. The other two prediction networks, , and , are introduced to stabilize pose regression when keypoint predictions are inaccurate. Specifically, predicts edge vectors along a pre-defined graph of keypoints, which stabilizes pose regression when keypoints are cluttered in the input image. predicts symmetry correspondences that reflect the underlying (partial) reflection symmetry. A key advantage of this symmetry representation is that the number of symmetry correspondences is large: every image pixel on the object has a symmetry correspondence. As a result, even with a large outlier ratio, symmetry correspondences still provide sufficient constraints for estimating the plane of reflection symmetry for regularizing the underlying pose. Moreover, symmetry correspondences incorporate more features within the interior of the underlying object than keypoints and edge vectors.

Pose regression module (Section 3.3). The second module of HybridPose optimizes the object pose to fit the output of the three prediction networks. This module combines a trainable initialization sub-module and a trainable refinement sub-module. In particular, the initialization sub-module performs SVD to solve for an initial pose in the global affine pose space. The refinement sub-module utilizes robust norms to filter out outliers in the predicted elements for accurate object pose estimation.

Training HybridPose (Section 3.4). We train HybridPose by splitting the dataset into a training set and a validation set. We use the training set to learn the prediction module, and the validation set to learn the hyper-parameters of the pose regression module. We have tried training HybridPose end-to-end using one training set. However, the difference between the prediction distributions on the training set and testing set leads to sub-optimal generalization performance.

3.2 Hybrid Representation

This section describes three intermediate representations used in HybridPose.

Keypoints. The first intermediate representation consists of keypoints, which have been widely used for pose estimation. Given the input image , we train a neural network to predict 2D coordinates of a pre-defined set of keypoints. In our experiments, HybridPose uses an off-the-shelf model called PVNet [33], which is the state-of-the-art keypoint-based pose estimator that employs a voting scheme to predict both visible and invisible keypoints.

Besides outliers in predicted keypoints, another limitation of keypoint-based techniques is that when the difference (direction and distance) between adjacent keypoints characterizes important information of the object pose, inexact keypoint predictions incur large pose error.

Edges. The second intermediate representation, which consists of edge vectors along a pre-defined graph, explicitly models the displacement between every pair of keypoints. As illustrated in Figure 2, HybridPose utilizes a simple network to predict edge vectors in the 2D image plane, where denotes the number of edges in the pre-defined graph. In our experiments, is a fully-connected graph, i.e., .

Symmetry correspondences. The third intermediate representation consists of predicted pixel-wise symmetry correspondences that reflect the underlying reflection symmetry. In our experiments, HybridPose extends the network architecture of FlowNet 2.0 [14]

that combines a dense pixel-wise flow and the semantic mask predicted by PVNet. The resulting symmetry correspondences are given by predicted pixel-wise flow within the mask region. Compared to the first two representations, the number of symmetry correspondences is significantly larger, which provides rich constraints even for occluded objects. However, symmetry correspondences only constrain two degrees of freedom in the rotation component of the object pose (c.f. 

[23]). It is necessary to combine symmetry correspondences with other intermediate representations.

Summary of network design. In our experiments, , , and are all based on ResNet [10], and the implementation details are discussed in Section 4.1. Trainable parameters are shared across all except the last convolutional layer. Therefore, the overhead of introducing the edge prediction network and the symmetry prediction network is insignificant.

3.3 Pose Regression

The second module of HybridPose takes predicted intermediate representations as input and outputs a 6D object pose for the input image . Similar to state-of-the-art pose regression approaches [34], HybridPose combines an initialization sub-module and a refinement sub-module. Both sub-modules leverage all predicted elements. The refinement sub-module additionally leverages a robust function to model outliers in the predicted elements.

In the following, we denote 3D keypoint coordinates in the canonical coordinate system as . To make notations uncluttered, we denote output of the first module, i.e., predicted keypoints, edge vectors, and symmetry correspondences as , , and , respectively. Our formulation also uses the homogeneous coordinates , , and of , , and respectively. The homogeneous coordinates are normalized by camera intrinsic matrix.

Initialization sub-module. This sub-module leverages constraints between and predicted elements and solves in the affine space, which are then projected to in an alternating optimization manner. To this end, we introduce the following difference vectors for each type of predicted elements:


where and are end vertices of edge , , and is the normal of the reflection symmetry plane in the canonical system.

HybridPose modifies the framework of EPnP [17] to generate the initial poses. By combining these three constraints from predicted elements, we generate a linear system of the form , where is matrix and its dimension is . is a vector that contains rotation and translation parameters in affine space. To model the relative importance among keypoints, edge vectors, and symmetry correspondences, we rescale (2) and (3) by hyper-parameters and , respectively, to generate .

Following EPnP [17], we compute as


where is the smallest right singular vector of . Ideally, when predicted elements are noise-free, with is an optimal solution. However, this strategy performs poorly given noisy predictions. Same as EPnP [17], we choose . To compute the optimal , we optimize latent variables and the rotation matrix in an alternating optimization procedure with following objective function:


where is reshaped from the first elements of . After obtaining optimal , we project the resulting affine transformation into a rigid transformation. Due to space constraint, we defer details to the supp. material.

Refinement sub-module. Although (5) combines hybrid intermediate representations and admits good initialization, it does not directly model outliers in predicted elements. Another limitation comes from (1) and (2), which do not minimize the projection errors (i.e., with respect to keypoints and edges), which are known to be effective in landmark-based pose estimation (c.f. [34]).

Benefited from having an initial object pose , the refinement sub-module performs local optimization to refine the object pose. We introduce two difference vectors that involve projection errors:


where is the projection operator induced from the current pose .

To prune outliers in the predicted elements, we consider a generalized German-Mcclure (or GM) robust function


With this setup, HybridPose solves the following non-linear optimization problem for pose refinement:


where , , and are separate hyper-parameters for keypoints, edges, and symmetry correspondences. and denote the covariance information attached to the keypoint and edge predictions. . When covariances of predictions are unavailable, we simply set .

Starting from and , the refinement sub-module employs the Gauss-Newton method for numerical optimization.

In the supp. material, we provide a stability analysis of (9), and show how the optimal solution of (9) changes with respect to noise in predicted representations. We also show collaborative strength among all three intermediate representations. While keypoints significantly contribute to the accuracy of , edge vectors and symmetry correspondences can stablize the regression of .

3.4 HybridPose Training

This section describes how to train the prediction networks and hyper-parameters of HybridPose using a labeled dataset . With , , , , and , we denote the RGB image, labeled keypoints, edges, symmetry correspondences, and ground-truth object pose, respectively. A popular strategy is to train the entire model end-to-end, e.g., using recurrent networks to model the optimization procedure and introducing loss terms on the object pose output as well as the intermediate representations. However, we found this strategy sub-optimal. The distribution of predicted elements on the training set differs from that on the testing set. Even by carefully tuning the trade-off between supervisions on predicted elements and the final object pose, the pose regression model, which fits the training data, generalizes poorly on the testing data.

Our approach randomly divides the labeled set into a training set and a validation set. is used to train the prediction networks, and trains the hyper-parameters of the pose regression model. Implementation and training details of the prediction networks are presented in Section 4.1. In the following, we focus on training the hyper-parameters using .

Initialization sub-module. Let and be the output of the initialization sub-module. we obtain the optimal hyper-parameters and by solving the following optimization problem:


Since the number of hyper-parameters is rather small, and the pose initialization step does not admit an explicit expression, we use the finite-difference method to compute numerical gradient, i.e., by fitting the gradient to samples of the hyper-parameters around the current solution. We then apply back-track line search for optimization.

Figure 3: Pose regression results. HybridPose is able to accurately predict 6D poses from RGB images. HybridPose handles situations where the object has no occlusion (a, d, f, h), light occlusion (b, c), and severe occlusion (e, g). For illustrative purposes, we only draw 50 randomly selected symmetry correspondences in each example.

Refinement sub-module. Let be the hyper-parameters of this sub-module. For each instance , denote the objective function in (9) as , where is a local parameterization of and , i.e.,

The refinement module solves an unconstrained optimization problem, whose optimal solution is dictated by its critical points and the loss surface around the critical points. We consider two simple objectives. The first objective forces , or in other words, the ground-truth is approximately a critical point. The second objective minimizes the condition number . This objective regularizes the loss surface around each optimal solution, promoting a large converge radius for . With this setup, we formulate the following objective function to optimize :


where is set to . The same strategy used in (10) is then applied to optimize (11).

4 Experimental Evaluation

This section presents an experimental evaluation of the proposed approach. Section 4.1 describes the experimental setup. Section 4.2 quantitatively and qualitatively compares HybridPose with other 6D pose estimation methods. Section 4.3 presents an ablation study to investigate the effectiveness of symmetry correspondences, edge vectors, and the refinement sub-module.

4.1 Experimental Setup

Datasets. We consider two popular benchmark datasets that are widely used in the 6D pose estimation problem, Linemod [11] and Occlusion Linemod [3]. In comparsion to Linemod, Occlusion Linemod contains more examples where the objects are under occlusion. Our keypoint annotation strategy follows that of [33], i.e., we choose keypoints via the farthest point sampling algorithm. Edge vectors are defined as vectors connecting each pair of keypoints. In total, each object has edges. We further use the algorithm proposed in [8] to annotate Linemod and Occlusion Linemod with reflection symmetry labels.

Implementation details. We use ResNet [10]

with pre-trained weights on ImageNet 

[7] to build the prediction networks , , and . The prediction networks take an RGB image of size as input, and output a tensor of size , where is the image resolution, and is the number of channels in the output tensor.

The first channel in the output tensor is a binary segmentation mask . If , then corresponds to a pixel on the object of interest in the input image . The segmentation mask is trained using the cross-entropy loss.

The second channels in the output tensor give and components of all keypoints. A voting-based keypoint localization scheme [33] is applied to extract the coordinates of 2D keypoints from this -channel tensor and the segmentation mask .

The next channels in the output tensor give the and components of all edges, which we denote as . Let () be the index of an edge. Then

is a set of 2-tuples containing pixel-wise predictions of the edge vector, whose mean is extracted as the predicted edge.

The final 2 channels in the output tensor define the and components of symmetry correspondences. We denote this 2-channel “map” of symmetry correspondences as . Let be a pixel on the object of interest in the input image, i.e. . Assuming and , we consider and to be symmetric with respect to the reflection symmetry plane.

We train all three intermediate representations using the smooth loss described in [9]. Network training employs the Adam [16]

optimizer with a learning rate of 0.02 for 200 epochs. Training weights of the segmentation mask, keypoints, edge vectors, and symmetry correspondences are 1.0, 10.0, 0.1, and 0.1, respectively.

Evaluation protocols. We use two metrics to evaluate the performance of HybridPose:

1. ADD(-S) [11, 41] first calculates the distance between two point sets transformed by predicted pose and ground-truth pose respectively, and then extracts the mean distance. When the object possesses symmetric pose ambiguity, the mean distance is computed from the closest points between two transformed sets. ADD(-S) accuracy is defined as the percentage of examples whose calculated mean distance is less than 10% of the model diameter.

2. In the ablation study, we compute and report the the angular rotation error and the relative translation error between the predicted pose and the ground-truth pose , where is object diameter.

4.2 Analysis of Results

object Tekin BB8 Pix2Pose PVNet CDPN DPOD Ours
ape 21.6 40.4 58.1 43.6 64.4 87.7 77.6
benchvise 81.8 91.8 91.0 99.9 97.8 98.5 99.6
cam 36.6 55.7 60.9 86.9 91.7 96.1 95.9
can 68.8 64.1 84.4 95.5 95.9 99.7 93.6
cat 41.8 62.6 65.0 79.3 83.8 94.7 93.5
driller 63.5 74.4 76.3 96.4 96.2 98.8 97.2
duck 27.2 44.3 43.8 52.6 66.8 86.3 87.0
eggbox 69.6 57.8 96.8 99.2 99.7 99.9 99.6
glue 80.0 41.2 79.4 95.7 99.6 96.8 98.7
holepuncher 42.6 67.2 74.8 81.9 85.8 86.9 92.5
iron 75.0 84.7 83.4 98.9 97.9 100.0 98.1
lamp 71.1 76.5 82.0 99.3 97.9 96.8 96.9
phone 47.7 54.0 45.0 92.4 90.8 94.7 98.3
average 56.0 62.7 72.4 86.3 89.9 95.2 94.5
Table 1: Quantitative evaluation: ADD(-S) accuracy on Linemod. Baseline approaches: Tekin et al. [37], BB8 [35], Pix2Pose [29], PVNet [33], CDPN [19], and DPOD [43]. Objects annotated with () possess symmetric pose ambiguity.
object PoseCNN Oberweger Hu Pix2Pose PVNet DPOD Ours
ape 9.6 12.1 17.6 22.0 15.8 - 53.3
can 45.2 39.9 53.9 44.7 63.3 - 86.5
cat 0.93 8.2 3.3 22.7 16.7 - 73.4
driller 41.4 45.2 62.4 44.7 65.7 - 92.8
duck 19.6 17.2 19.2 15.0 25.2 - 62.8
eggbox 22.0 22.1 25.9 25.2 50.2 - 95.3
glue 38.5 35.8 39.6 32.4 49.6 - 92.5
holepuncher 22.1 36.0 21.3 49.5 39.7 - 76.7
average 24.9 27.0 27.0 32.0 40.8 47.3 79.2
Table 2: Quantitative evaluation: ADD(-S) accuracy on Occlusion Linemod. Baseline approaches: PoseCNN [41], Oberweger et al. [28], Hu et al. [13], PVNet [33], and DPOD [43]. Objects annotated with () possess symmetric pose ambiguity.

As shown in Table 1, Table 2, and Figure 3, HybridPose leads to accurate pose estimation. On Linemod and Occlusion Linemod, HybridPose has an average ADD(-S) accuracy of 94.5 and 79.2, respectively. The result on Linemod outperforms all except one state-of-the-art approaches that regress poses from intermediate representations. The result on Occlusion-Linemod outperforms all state-of-the-art approaches.

Baseline comparison on Linemod. HybridPose outperforms PVNet [33], the backbone model we use to predict keypoints. The improvement is consistent across all object classes, which demonstrates clear advantage of using a hybrid as opposed to unitary intermediate representation. HybridPose shows competitive results against DPOD [43], winning on five object classes. The advantage of DPOD on this particular dataset comes from data augmentation and explicit modeling of dense correspondences between input and projected images, both of which cater to situations without object occlusion. A detailed analysis reveals that the classes of objects on which HybridPose exhibits sub-optimal performance are the smallest objects in Linemod. It suggests that pixel-based descriptors used in our pipeline are limited by image resolution.

keypoints keypoints + symmetries full model
Rotation Translation Rotation Translation Rotation Translation
ape 1.914° 0.107 1.809° 0.113 1.543° 0.092
can 1.472° 0.059 1.710° 0.073 0.912° 0.041
cat 1.039° 0.119 0.888° 0.117 0.751° 0.055
driller 1.180° 0.057 1.180° 0.057 0.803° 0.027
duck 1.773° 0.116 1.679° 0.115 1.439° 0.068
eggbox 1.675° 0.107 1.587° 0.105 1.052° 0.052
glue 1.796° 0.097 1.681° 0.099 1.224° 0.066
holepuncher 2.319° 0.141 2.192° 0.140 1.704° 0.051
mean 1.648° 0.100 1.590° 0.102 1.179° 0.057
Table 3: Qualitative evaluation with different intermediate representations. We report errors using two metrics: the median of absolute angular error in rotation, and the median of relative error in translation with respect to object diameter.

Baseline comparison on Occlusion-Linemod. HybridPose outperforms all baseline approaches by a considerable margin. In terms of ADD(-S) accuracy, our approach improves PVNet [33] from 40.8 to 79.2, representing a 94.1% enhancement. This enhancement clearly shows the advantage of HybridPose on occluded objects, where predictions of invisible keypoints can be noisy, and visible keypoints may not provide sufficient constraints for pose regression alone. HybridPose also outperforms DPOD, the current state-of-the-art pose estimator on Occlusion Linemod, by 67.4%. One explanation is that rendering-based approaches like DPOD work less well on occluded objects, due to the difficulty in modeling occlusions in data augmentation and correspondence computation.

Running time. On a desktop with 16-core Intel(R) Xeon(R) E5-2637 CPU and GeForce GTX 1080 GPU, HybridPose takes 0.6 second to predict the intermediate representations, 0.4 second to regress the pose. Assuming a batch size of 30, this gives an average processing speed of 30 frames per second, enabling real-time analysis.

4.3 Ablation Study

We proceed to provide an ablation study. Table 3 summarizes the performance of HybridPose using different predicted intermediate representations. Since the performance of different methods is close to saturate on Linemod, the ablation study we present here is based on Occlusion Linemod, which clearly reveals the effect of different predicted elements on pose refinement. Ablation study on Linemod is deferred to the supp. material.

With keypoints. As a baseline approach, we first estimate object poses by only utilizing keypoint information. As shown in Table 3, this gives a mean absolute rotation error of 1.648 degrees, and a mean relative translation error of 0.100.

With keypoints and symmetry. Adding symmetry correspondences to keypoints leads to noticeable performance gains in the rotation component. The relative performance gain is 3.52%, and such improvement is almost consistent across all object categories. The consistent improvement clearly demonstrates the effectiveness of symmetry correspondences. On the other hand, the translation error of using keypoints and using keypoints + symmetry remains almost the same. One explanation is that symmetry correspondences only constrain two degrees of freedom in a total of three rotation parameters, while provides no constraints on translation parameters.

Full model. Adding edge vectors to keypoints and symmetry correspondences leads to salient performance gain in both rotation and translation estimations. The relative performance gains in rotation and translation are 25.85% and 44.12%, respectively. One explanation is that edge vectors provide more constraints on both translation and rotation. Compared with keypoints, edge vectors provide more constraints on translation as it represent adjacent keypoints displacement and provides gradient information for regression. As a result, the translation error is significantly reduced. Compared with symmetry correspondences, which only provide 2 constraints on rotation, edge vectors constrain 3 degrees of freedom on rotation parameters which boosts the performance of rotation estimation. Additionally, the improved rotation estimation helps the GM robust function in the refinement sub-module to identify outliers in keypoint prediction.

5 Conclusions and Future Work

In this paper, we introduce HybridPose, a 6D pose estimation approach that utilizes keypoints, edge vectors, and symmetry correspondences. Experiments show that HybridPose enjoys real-time prediction and outperforms current state-of-the-art pose estimation approaches in accuracy. HybridPose is robust to occlusion and extreme poses. In the future, we plan to extend HybridPose to include more intermediate representations such as shape primitives, normals, and planar faces. Another possible direction of future work is to enforce consistency across different representations as a self-supervision loss in network training.


  • [1] I. R. Atadjanov and S. Lee (2016) Reflection symmetry detection via appearance of structure descriptor. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III, pp. 3–18. External Links: Link, Document Cited by: §2.
  • [2] G. Bertasius, J. Shi, and L. Torresani (2015) Deepedge: a multi-scale bifurcated deep network for top-down contour detection. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 4380–4389. Cited by: §2.
  • [3] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother (2014) Learning 6d object pose estimation using 3d object coordinates. In European conference on computer vision, pp. 536–551. Cited by: HybridPose: 6D Object Pose Estimation under Hybrid Representations, §1, §2, §4.1.
  • [4] E. Brachmann, F. Michel, A. Krull, M. Ying Yang, S. Gumhold, et al. (2016) Uncertainty-driven 6d pose estimation of objects and scenes from a single rgb image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3364–3372. Cited by: §2.
  • [5] Z. Cao, T. Simon, S. Wei, and Y. Sheikh (2017) Realtime multi-person 2d pose estimation using part affinity fields. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 1302–1310. External Links: Link, Document Cited by: §2.
  • [6] E. Corona, K. Kundu, and S. Fidler (2018) Pose estimation for objects with rotational symmetry. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7215–7222. Cited by: §2.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei (2009-06) ImageNet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 248–255. External Links: Document, ISSN Cited by: §4.1.
  • [8] A. Ecins, C. Fermüller, and Y. Aloimonos (2018) Seeing behind the scene: using symmetry to reason about objects in cluttered environments. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7193–7200. Cited by: §4.1.
  • [9] R. Girshick (2015-12) Fast r-cnn. 2015 IEEE International Conference on Computer Vision (ICCV). External Links: ISBN 9781467383912, Link, Document Cited by: §4.1.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385. Cited by: §3.2, §4.1.
  • [11] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab (2013) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Proceedings of the 11th Asian Conference on Computer Vision - Volume Part I, ACCV’12, Berlin, Heidelberg, pp. 548–562. External Links: ISBN 978-3-642-37330-5, Link, Document Cited by: §1, §4.1, §4.1.
  • [12] W. Hong, A. Y. Yang, K. Huang, and Y. Ma (2004) On symmetry and multiple-view geometry: structure, pose, and calibration from a single image. International Journal of Computer Vision 60 (3), pp. 241–265. Cited by: §2.
  • [13] Y. Hu, J. Hugonot, P. Fua, and M. Salzmann (2019) Segmentation-driven 6d object pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3385–3394. Cited by: §2, Table 2.
  • [14] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox (2017) FlowNet 2.0: evolution of optical flow estimation with deep networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 1647–1655. External Links: Link, Document Cited by: §3.2.
  • [15] A. Kendall, M. Grimes, and R. Cipolla (2015) Posenet: a convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pp. 2938–2946. Cited by: §1.
  • [16] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. External Links: 1412.6980 Cited by: §4.1.
  • [17] V. Lepetit, F. Moreno-Noguer, and P. Fua (2009) Epnp: an accurate o (n) solution to the pnp problem. International journal of computer vision 81 (2), pp. 155. Cited by: §2, §3.3, §3.3.
  • [18] Y. Li, G. Wang, X. Ji, Y. Xiang, and D. Fox (2018) Deepim: deep iterative matching for 6d pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 683–698. Cited by: §2.
  • [19] Z. Li, G. Wang, and X. Ji (2019) CDPN: coordinates-based disentangled pose network for real-time rgb-based 6-dof object pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7678–7687. Cited by: §2, Table 1.
  • [20] G. Liu, C. Yang, Z. Li, D. Ceylan, and Q. Huang (2016) Symmetry-aware depth estimation using deep neural networks. arXiv preprint arXiv:1604.06079. Cited by: §2.
  • [21] Y. Liu, H. Hel-Or, C. S. Kaplan, and L. V. Gool (2010) Computational symmetry in computer vision and computer graphics. Foundations and Trends in Computer Graphics and Vision 5 (1-2), pp. 1–195. External Links: Link, Document Cited by: §2.
  • [22] Y. Liu and M. S. Lew (2016) Learning relaxed deep supervision for better edge detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 231–240. Cited by: §2.
  • [23] Y. Ma, S. Soatto, J. Kosecka, and S. S. Sastry (2003) An invitation to 3-d vision: from images to geometric models. SpringerVerlag. External Links: ISBN 0387008934 Cited by: §3.2.
  • [24] F. Manhardt, D. M. Arroyo, C. Rupprecht, B. Busam, T. Birdal, N. Navab, and F. Tombari (2019) Explaining the ambiguity of object detection and 6d pose from visual data. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6841–6850. Cited by: §2.
  • [25] F. Manhardt, W. Kehl, N. Navab, and F. Tombari (2018) Deep model-based 6d pose refinement in rgb. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 800–815. Cited by: §2.
  • [26] N. J. Mitra, M. Pauly, M. Wand, and D. Ceylan (2013-09) Symmetry in 3d geometry: extraction and applications. Comput. Graph. Forum 32 (6), pp. 1–23. External Links: ISSN 0167-7055, Link, Document Cited by: §2.
  • [27] A. Newell, K. Yang, and J. Deng (2016) Stacked hourglass networks for human pose estimation. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, pp. 483–499. External Links: Link, Document Cited by: §2.
  • [28] M. Oberweger, M. Rad, and V. Lepetit (2018) Making deep heatmaps robust to partial occlusions for 3d object pose estimation. Lecture Notes in Computer Science, pp. 125–141. External Links: ISBN 9783030012670, ISSN 1611-3349, Link, Document Cited by: Table 2.
  • [29] K. Park, T. Patten, and M. Vincze (2019) Pix2Pose: pixel-wise coordinate regression of objects for 6d pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7668–7677. Cited by: §2, §2, Table 1.
  • [30] G. Passalis, P. Perakis, T. Theoharis, and I. A. Kakadiaris (2011) Using facial symmetry to handle pose variations in real-world 3d face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (10), pp. 1938–1951. Cited by: §2.
  • [31] G. Pavlakos, X. Zhou, A. Chan, K. G. Derpanis, and K. Daniilidis (2017) 6-dof object pose from semantic keypoints. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2011–2018. Cited by: §2, §2.
  • [32] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §5.
  • [33] S. Peng, Y. Liu, Q. Huang, H. Bao, and X. Zhou (2018) PVNet: pixel-wise voting network for 6dof pose estimation. CoRR abs/1812.11788. Cited by: §1, §2, §2, §3.1, §3.2, §4.1, §4.1, §4.2, §4.2, Table 1, Table 2.
  • [34] M. Persson and K. Nordberg (2018-09) Lambda twist: an accurate fast robust perspective three point (p3p) solver. In The European Conference on Computer Vision (ECCV), Cited by: §3.3, §3.3.
  • [35] M. Rad and V. Lepetit (2017) BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3836. Cited by: §2, §2, Table 1.
  • [36] M. Sundermeyer, Z. Marton, M. Durner, M. Brucker, and R. Triebel (2018) Implicit 3d orientation learning for 6d object detection from rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 699–715. Cited by: §2.
  • [37] B. Tekin, S. N. Sinha, and P. Fua (2018) Real-time seamless single shot 6d object pose prediction. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 292–301. External Links: Link, Document Cited by: §1, Table 1.
  • [38] S. Tulsiani and J. Malik (2015) Viewpoints and keypoints. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 1510–1519. External Links: Link, Document Cited by: §1.
  • [39] C. Wang, D. Xu, Y. Zhu, R. Martín-Martín, C. Lu, L. Fei-Fei, and S. Savarese (2019) Densefusion: 6d object pose estimation by iterative dense fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3343–3352. Cited by: §2, §2.
  • [40] Z. Wang, Z. Tang, and X. Zhang (2015) Reflection symmetry detection using locally affine invariant edge correspondence. IEEE Trans. Image Processing 24 (4), pp. 1297–1301. External Links: Link, Document Cited by: §2.
  • [41] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox (2018)

    PoseCNN: A convolutional neural network for 6d object pose estimation in cluttered scenes

    In Robotics: Science and Systems XIV, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA, June 26-30, 2018, External Links: Link, Document Cited by: §1, §2, §2, §4.1, Table 2.
  • [42] T. Xue, J. Liu, and X. Tang (2011) Symmetric piecewise planar object reconstruction from a single image. In CVPR 2011, pp. 2577–2584. Cited by: §2.
  • [43] S. Zakharov, I. Shugurov, and S. Ilic (2019) DPOD: 6d pose object detector and refiner. External Links: 1902.11020 Cited by: §1, §2, §4.2, Table 1, Table 2.
  • [44] Z. Zhang, Z. Li, N. Bi, J. Zheng, J. Wang, K. Huang, W. Luo, Y. Xu, and S. Gao (2019) PPGNet: learning point-pair graph for line segment detection. arXiv preprint arXiv:1905.03415. Cited by: §2.
  • [45] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858. Cited by: §2, §2.
  • [46] X. Zhou, A. Karpur, L. Luo, and Q. Huang (2018) StarMap for category-agnostic keypoint and viewpoint estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 318–334. Cited by: §2.