1 Introduction
Estimating the 6D pose of an object from an RGB image is a fundamental problem in 3D vision and has diverse applications in object recognition and robotobject interaction. Advances in deep learning have led to significant breakthroughs in this problem. While early works typically formulate pose estimation as endtoend pose classification
[38] or pose regression [15, 41], recent pose estimation methods usually leverage keypoints as an intermediate representation [37, 33], and align predicted 2D keypoints with groundtruth 3D keypoints for pose estimation. In addition to groundtruth pose labels, these methods incorporate keypoints as an intermediate supervision, facilitating smooth model training. Keypointbased methods are built upon two assumptions: (1) a machine learning model can accurately predict 2D keypoint locations; and (2) these predictions provide sufficient constraints to regress the underlying 6D pose. Both assumptions easily break in many realworld settings. Due to object occlusions and representational limitations of the prediction network, it is often impossible to accurately predict 2D keypoint coordinates from an RGB image alone.
In this paper, we introduce HybridPose, which leverages multiple intermediate representations to express the geometric information in the input image for pose estimation. In addition to keypoints, HybridPose integrates a prediction network that outputs edge vectors between adjacent keypoints. As most objects possess a (partial) reflection symmetry, HybridPose also utilizes predicted dense pixelwise correspondences that reflect the underlying symmetric relations between pixels. Compared to a unitary representation, this hybrid representation enjoys a multitude of advantages. First, HybridPose integrates more signals in the input image: edge vectors include object skeleton structure, and symmetry correspondences incorporate interior details. Second, HybridPose offers far more constraints than using keypoints alone for pose regression, enabling accurate pose prediction even if a significant fraction of predicted elements are outliers (e.g., under occlusions). Finally, it can be shown that symmetry correspondences stabilize the rotation component of pose prediction, especially along the normal direction of the reflection plane.
Given the intermediate representation predicted by the first module, the second module of HybridPose performs pose regression. In particular, HybridPose employs trainable robust norms to prune outliers in predicted intermediate representation. We show how to combine pose initialization and pose refinement to maximize the quality of the resulting object pose. We also show how to train HybridPose effectively using a training set for the prediction module, and a validation set for the refinement module.
We evaluate HybridPose on two popular benchmark datasets, Linemod [11] and Occlusion Linemod [3]. In terms of accuracy (under the ADD(S) metric), HybridPose leads to considerable improvements from all stateoftheart methods that merely utilize keypoints. On Occlusion Linemod [3], HybridPose achieves an accuracy of 79.2%, which represents a 67.4% improvement from DPOD [43], the current stateoftheart method on this benchmark dataset.
Despite the gain in accuracy, our approach is efficient and runs at 30 frames per second on a commodity workstation. Compared to approaches which utilize sophisticated network architecture to predict one single intermediate representation, HybridPose achieves considerably better performance by using a relative simple network to predict hybrid representations.
2 Related Works
Intermediate representation for pose. To express the geometric information in an RGB image, a prevalent intermediate representation is keypoints, which achieves stateoftheart performance [33, 31, 35]. The corresponding pose estimation pipeline combines keypoint prediction and pose regression initialized by the PnP algorithm [17]
. Keypoint predictions are usually generated by a neural network, and previous works use different types of tensor descriptors to express 2D keypoint coordinates. A common approach represents keypoints as peaks of heatmaps
[27, 46], which becomes suboptimal when keypoints are occluded, as the input image does not provide explicit visual cues for their locations. Alternative keypoint representations include vectorfields [33] and patches [13]. These representations allow better keypoint predictions under occlusion, and eventually lead to improvement in pose estimation accuracy. However, keypoints alone are a sparse representation of the object pose, whose potential in improving estimation accuracy is limited.Besides keypoints, another common intermediate representation is the coordinate of every image pixel in the 3D physical world, which provides dense 2D3D correspondences for pose alignment, and is robust under occlusion [3, 4, 29, 19]. However, regressing dense object coordinates is much more costly than keypoint prediction. They are also less accurate than keypoints due to the lack of corresponding visual cues. In addition to keypoints and pixelwise 2D3D correspondences, depth is another alternative intermediate representation in visual odometry settings, which can be estimated together with pose in an unsupervised manner [45]. In practice, the accuracy of depth estimation is limited by the representational power of neural networks.
Unlike previous approaches, HybridPose combines multiple intermediate representations, and exhibits collaborative strength for pose estimation.
Multimodal input. To address the challenges for pose estimation from a single RGB image, several works have considered inputs from multiple sensors. A popular approach is to leverage information from both RGB and depth images [45, 39, 41]. In the presence of depth information, pose regression can be reformulated as the 3D point alignment problem, which is then sovled by the ICP algorithm [41]. Although HybridPose utilizes multiple intermediate representations, all intermediate representations are predicted from an RGB image alone. HybridPose handles situations in which depth information is absent.
Edge features. Edges are known to capture important image features such as object contours [2], salient edges [22], and straight line segments [44]. Unlike these lowlevel image features, HybridPose leverages semantic edge vectors defined between adjacent keypoints. This representation, which captures correlations between keypoints and reveals underlying structure of object, is concise and easy to predict. Such edge vectors offer more constraints than keypoints alone for pose regressions and have clear advantages under occlusion. Our approach is similar to [5], which predicts directions between adjacent keypoints to link keypoints into a human skeleton. However, we predict both the direction and the magnitude of edge vectors, and use these vectors to estimate object poses.
Symmetry detection from images.
Symmetry detection has received significant attention in computer vision. We refer readers to
[21, 26] for general surveys, and [1, 40]for recent advances. Traditional applications of symmetry detection include face recognition
[30], depth estimation [20], and 3D reconstruction [12, 42]. In the context of object pose estimation, people have studied symmetry from the perspective that it introduces ambiguities for pose estimation (c.f. [24, 35, 41]), since symmetric objects with different poses can have the same appearance in image. Several works [35, 41, 6, 24, 29]have explored how to address such ambiguities, e.g., by designing loss functions that are invariant under symmetric transformations.
Robust regression. Pose estimation via intermediate representation is sensitive to outliers in predictions, which are introduced by occlusion and cluttered backgrounds [36, 31, 39]. To mitigate pose error, several works assign different weights to different predicted elements in the 2D3D alignment stage [33, 31]. In contrast, our approach additionally leverages robust norms to automatically filter outliers in the predicted elements.
Besides the reweighting strategy, some recent works propose to use deep learningbased refiners to boost the pose estimation performance [18, 25, 43]. [43, 18] use point matching loss and achieve high accuracy. [25] predicts pose updates using contour information. Unlike these works, our approach considers the critical points and the loss surface of the robust objective function, and does not involve a fixed predetermined iteration count used in recurrent network based approaches.
3 Approach
The input to HybridPose is an image containing an object in a known class, taken by a pinhole camera with known intrinsic parameters. Assuming that the class of objects has a canonical coordinate system (i.e. the 3D point cloud), HybridPose outputs the 6D camera pose of the image object under , where is the rotation component and is the translation component.
3.1 Approach Overview
As illustrated in Figure 2, HybridPose consists of a prediction module and a pose regression module.
Prediction module (Section 3.2). HybridPose utilizes three prediction networks , , and to estimate a set of keypoints , a set of edges between keypoints , and a set of symmetry correspondences between image pixels . , , and are all represented in 2D. , , and are trainable parameters.
The keypoint network employs an offtheshelf prediction network [33]. The other two prediction networks, , and , are introduced to stabilize pose regression when keypoint predictions are inaccurate. Specifically, predicts edge vectors along a predefined graph of keypoints, which stabilizes pose regression when keypoints are cluttered in the input image. predicts symmetry correspondences that reflect the underlying (partial) reflection symmetry. A key advantage of this symmetry representation is that the number of symmetry correspondences is large: every image pixel on the object has a symmetry correspondence. As a result, even with a large outlier ratio, symmetry correspondences still provide sufficient constraints for estimating the plane of reflection symmetry for regularizing the underlying pose. Moreover, symmetry correspondences incorporate more features within the interior of the underlying object than keypoints and edge vectors.
Pose regression module (Section 3.3). The second module of HybridPose optimizes the object pose to fit the output of the three prediction networks. This module combines a trainable initialization submodule and a trainable refinement submodule. In particular, the initialization submodule performs SVD to solve for an initial pose in the global affine pose space. The refinement submodule utilizes robust norms to filter out outliers in the predicted elements for accurate object pose estimation.
Training HybridPose (Section 3.4). We train HybridPose by splitting the dataset into a training set and a validation set. We use the training set to learn the prediction module, and the validation set to learn the hyperparameters of the pose regression module. We have tried training HybridPose endtoend using one training set. However, the difference between the prediction distributions on the training set and testing set leads to suboptimal generalization performance.
3.2 Hybrid Representation
This section describes three intermediate representations used in HybridPose.
Keypoints. The first intermediate representation consists of keypoints, which have been widely used for pose estimation. Given the input image , we train a neural network to predict 2D coordinates of a predefined set of keypoints. In our experiments, HybridPose uses an offtheshelf model called PVNet [33], which is the stateoftheart keypointbased pose estimator that employs a voting scheme to predict both visible and invisible keypoints.
Besides outliers in predicted keypoints, another limitation of keypointbased techniques is that when the difference (direction and distance) between adjacent keypoints characterizes important information of the object pose, inexact keypoint predictions incur large pose error.
Edges. The second intermediate representation, which consists of edge vectors along a predefined graph, explicitly models the displacement between every pair of keypoints. As illustrated in Figure 2, HybridPose utilizes a simple network to predict edge vectors in the 2D image plane, where denotes the number of edges in the predefined graph. In our experiments, is a fullyconnected graph, i.e., .
Symmetry correspondences. The third intermediate representation consists of predicted pixelwise symmetry correspondences that reflect the underlying reflection symmetry. In our experiments, HybridPose extends the network architecture of FlowNet 2.0 [14]
that combines a dense pixelwise flow and the semantic mask predicted by PVNet. The resulting symmetry correspondences are given by predicted pixelwise flow within the mask region. Compared to the first two representations, the number of symmetry correspondences is significantly larger, which provides rich constraints even for occluded objects. However, symmetry correspondences only constrain two degrees of freedom in the rotation component of the object pose (c.f.
[23]). It is necessary to combine symmetry correspondences with other intermediate representations.Summary of network design. In our experiments, , , and are all based on ResNet [10], and the implementation details are discussed in Section 4.1. Trainable parameters are shared across all except the last convolutional layer. Therefore, the overhead of introducing the edge prediction network and the symmetry prediction network is insignificant.
3.3 Pose Regression
The second module of HybridPose takes predicted intermediate representations as input and outputs a 6D object pose for the input image . Similar to stateoftheart pose regression approaches [34], HybridPose combines an initialization submodule and a refinement submodule. Both submodules leverage all predicted elements. The refinement submodule additionally leverages a robust function to model outliers in the predicted elements.
In the following, we denote 3D keypoint coordinates in the canonical coordinate system as . To make notations uncluttered, we denote output of the first module, i.e., predicted keypoints, edge vectors, and symmetry correspondences as , , and , respectively. Our formulation also uses the homogeneous coordinates , , and of , , and respectively. The homogeneous coordinates are normalized by camera intrinsic matrix.
Initialization submodule. This submodule leverages constraints between and predicted elements and solves in the affine space, which are then projected to in an alternating optimization manner. To this end, we introduce the following difference vectors for each type of predicted elements:
(1)  
(2)  
(3) 
where and are end vertices of edge , , and is the normal of the reflection symmetry plane in the canonical system.
HybridPose modifies the framework of EPnP [17] to generate the initial poses. By combining these three constraints from predicted elements, we generate a linear system of the form , where is matrix and its dimension is . is a vector that contains rotation and translation parameters in affine space. To model the relative importance among keypoints, edge vectors, and symmetry correspondences, we rescale (2) and (3) by hyperparameters and , respectively, to generate .
Following EPnP [17], we compute as
(4) 
where is the smallest right singular vector of . Ideally, when predicted elements are noisefree, with is an optimal solution. However, this strategy performs poorly given noisy predictions. Same as EPnP [17], we choose . To compute the optimal , we optimize latent variables and the rotation matrix in an alternating optimization procedure with following objective function:
(5) 
where is reshaped from the first elements of . After obtaining optimal , we project the resulting affine transformation into a rigid transformation. Due to space constraint, we defer details to the supp. material.
Refinement submodule. Although (5) combines hybrid intermediate representations and admits good initialization, it does not directly model outliers in predicted elements. Another limitation comes from (1) and (2), which do not minimize the projection errors (i.e., with respect to keypoints and edges), which are known to be effective in landmarkbased pose estimation (c.f. [34]).
Benefited from having an initial object pose , the refinement submodule performs local optimization to refine the object pose. We introduce two difference vectors that involve projection errors:
(6)  
(7) 
where is the projection operator induced from the current pose .
To prune outliers in the predicted elements, we consider a generalized GermanMcclure (or GM) robust function
(8) 
With this setup, HybridPose solves the following nonlinear optimization problem for pose refinement:
(9) 
where , , and are separate hyperparameters for keypoints, edges, and symmetry correspondences. and denote the covariance information attached to the keypoint and edge predictions. . When covariances of predictions are unavailable, we simply set .
Starting from and , the refinement submodule employs the GaussNewton method for numerical optimization.
In the supp. material, we provide a stability analysis of (9), and show how the optimal solution of (9) changes with respect to noise in predicted representations. We also show collaborative strength among all three intermediate representations. While keypoints significantly contribute to the accuracy of , edge vectors and symmetry correspondences can stablize the regression of .
3.4 HybridPose Training
This section describes how to train the prediction networks and hyperparameters of HybridPose using a labeled dataset . With , , , , and , we denote the RGB image, labeled keypoints, edges, symmetry correspondences, and groundtruth object pose, respectively. A popular strategy is to train the entire model endtoend, e.g., using recurrent networks to model the optimization procedure and introducing loss terms on the object pose output as well as the intermediate representations. However, we found this strategy suboptimal. The distribution of predicted elements on the training set differs from that on the testing set. Even by carefully tuning the tradeoff between supervisions on predicted elements and the final object pose, the pose regression model, which fits the training data, generalizes poorly on the testing data.
Our approach randomly divides the labeled set into a training set and a validation set. is used to train the prediction networks, and trains the hyperparameters of the pose regression model. Implementation and training details of the prediction networks are presented in Section 4.1. In the following, we focus on training the hyperparameters using .
Initialization submodule. Let and be the output of the initialization submodule. we obtain the optimal hyperparameters and by solving the following optimization problem:
(10) 
Since the number of hyperparameters is rather small, and the pose initialization step does not admit an explicit expression, we use the finitedifference method to compute numerical gradient, i.e., by fitting the gradient to samples of the hyperparameters around the current solution. We then apply backtrack line search for optimization.
Refinement submodule. Let be the hyperparameters of this submodule. For each instance , denote the objective function in (9) as , where is a local parameterization of and , i.e.,
The refinement module solves an unconstrained optimization problem, whose optimal solution is dictated by its critical points and the loss surface around the critical points. We consider two simple objectives. The first objective forces , or in other words, the groundtruth is approximately a critical point. The second objective minimizes the condition number . This objective regularizes the loss surface around each optimal solution, promoting a large converge radius for . With this setup, we formulate the following objective function to optimize :
(11) 
where is set to . The same strategy used in (10) is then applied to optimize (11).
4 Experimental Evaluation
This section presents an experimental evaluation of the proposed approach. Section 4.1 describes the experimental setup. Section 4.2 quantitatively and qualitatively compares HybridPose with other 6D pose estimation methods. Section 4.3 presents an ablation study to investigate the effectiveness of symmetry correspondences, edge vectors, and the refinement submodule.
4.1 Experimental Setup
Datasets. We consider two popular benchmark datasets that are widely used in the 6D pose estimation problem, Linemod [11] and Occlusion Linemod [3]. In comparsion to Linemod, Occlusion Linemod contains more examples where the objects are under occlusion. Our keypoint annotation strategy follows that of [33], i.e., we choose keypoints via the farthest point sampling algorithm. Edge vectors are defined as vectors connecting each pair of keypoints. In total, each object has edges. We further use the algorithm proposed in [8] to annotate Linemod and Occlusion Linemod with reflection symmetry labels.
Implementation details. We use ResNet [10]
with pretrained weights on ImageNet
[7] to build the prediction networks , , and . The prediction networks take an RGB image of size as input, and output a tensor of size , where is the image resolution, and is the number of channels in the output tensor.The first channel in the output tensor is a binary segmentation mask . If , then corresponds to a pixel on the object of interest in the input image . The segmentation mask is trained using the crossentropy loss.
The second channels in the output tensor give and components of all keypoints. A votingbased keypoint localization scheme [33] is applied to extract the coordinates of 2D keypoints from this channel tensor and the segmentation mask .
The next channels in the output tensor give the and components of all edges, which we denote as . Let () be the index of an edge. Then
is a set of 2tuples containing pixelwise predictions of the edge vector, whose mean is extracted as the predicted edge.
The final 2 channels in the output tensor define the and components of symmetry correspondences. We denote this 2channel “map” of symmetry correspondences as . Let be a pixel on the object of interest in the input image, i.e. . Assuming and , we consider and to be symmetric with respect to the reflection symmetry plane.
We train all three intermediate representations using the smooth loss described in [9]. Network training employs the Adam [16]
optimizer with a learning rate of 0.02 for 200 epochs. Training weights of the segmentation mask, keypoints, edge vectors, and symmetry correspondences are 1.0, 10.0, 0.1, and 0.1, respectively.
Evaluation protocols. We use two metrics to evaluate the performance of HybridPose:
1. ADD(S) [11, 41] first calculates the distance between two point sets transformed by predicted pose and groundtruth pose respectively, and then extracts the mean distance. When the object possesses symmetric pose ambiguity, the mean distance is computed from the closest points between two transformed sets. ADD(S) accuracy is defined as the percentage of examples whose calculated mean distance is less than 10% of the model diameter.
2. In the ablation study, we compute and report the the angular rotation error and the relative translation error between the predicted pose and the groundtruth pose , where is object diameter.
4.2 Analysis of Results
object  Tekin  BB8  Pix2Pose  PVNet  CDPN  DPOD  Ours 

ape  21.6  40.4  58.1  43.6  64.4  87.7  77.6 
benchvise  81.8  91.8  91.0  99.9  97.8  98.5  99.6 
cam  36.6  55.7  60.9  86.9  91.7  96.1  95.9 
can  68.8  64.1  84.4  95.5  95.9  99.7  93.6 
cat  41.8  62.6  65.0  79.3  83.8  94.7  93.5 
driller  63.5  74.4  76.3  96.4  96.2  98.8  97.2 
duck  27.2  44.3  43.8  52.6  66.8  86.3  87.0 
eggbox  69.6  57.8  96.8  99.2  99.7  99.9  99.6 
glue  80.0  41.2  79.4  95.7  99.6  96.8  98.7 
holepuncher  42.6  67.2  74.8  81.9  85.8  86.9  92.5 
iron  75.0  84.7  83.4  98.9  97.9  100.0  98.1 
lamp  71.1  76.5  82.0  99.3  97.9  96.8  96.9 
phone  47.7  54.0  45.0  92.4  90.8  94.7  98.3 
average  56.0  62.7  72.4  86.3  89.9  95.2  94.5 
object  PoseCNN  Oberweger  Hu  Pix2Pose  PVNet  DPOD  Ours 

ape  9.6  12.1  17.6  22.0  15.8    53.3 
can  45.2  39.9  53.9  44.7  63.3    86.5 
cat  0.93  8.2  3.3  22.7  16.7    73.4 
driller  41.4  45.2  62.4  44.7  65.7    92.8 
duck  19.6  17.2  19.2  15.0  25.2    62.8 
eggbox  22.0  22.1  25.9  25.2  50.2    95.3 
glue  38.5  35.8  39.6  32.4  49.6    92.5 
holepuncher  22.1  36.0  21.3  49.5  39.7    76.7 
average  24.9  27.0  27.0  32.0  40.8  47.3  79.2 
As shown in Table 1, Table 2, and Figure 3, HybridPose leads to accurate pose estimation. On Linemod and Occlusion Linemod, HybridPose has an average ADD(S) accuracy of 94.5 and 79.2, respectively. The result on Linemod outperforms all except one stateoftheart approaches that regress poses from intermediate representations. The result on OcclusionLinemod outperforms all stateoftheart approaches.
Baseline comparison on Linemod. HybridPose outperforms PVNet [33], the backbone model we use to predict keypoints. The improvement is consistent across all object classes, which demonstrates clear advantage of using a hybrid as opposed to unitary intermediate representation. HybridPose shows competitive results against DPOD [43], winning on five object classes. The advantage of DPOD on this particular dataset comes from data augmentation and explicit modeling of dense correspondences between input and projected images, both of which cater to situations without object occlusion. A detailed analysis reveals that the classes of objects on which HybridPose exhibits suboptimal performance are the smallest objects in Linemod. It suggests that pixelbased descriptors used in our pipeline are limited by image resolution.
keypoints  keypoints + symmetries  full model  

Rotation  Translation  Rotation  Translation  Rotation  Translation  
ape  1.914°  0.107  1.809°  0.113  1.543°  0.092 
can  1.472°  0.059  1.710°  0.073  0.912°  0.041 
cat  1.039°  0.119  0.888°  0.117  0.751°  0.055 
driller  1.180°  0.057  1.180°  0.057  0.803°  0.027 
duck  1.773°  0.116  1.679°  0.115  1.439°  0.068 
eggbox  1.675°  0.107  1.587°  0.105  1.052°  0.052 
glue  1.796°  0.097  1.681°  0.099  1.224°  0.066 
holepuncher  2.319°  0.141  2.192°  0.140  1.704°  0.051 
mean  1.648°  0.100  1.590°  0.102  1.179°  0.057 
Baseline comparison on OcclusionLinemod. HybridPose outperforms all baseline approaches by a considerable margin. In terms of ADD(S) accuracy, our approach improves PVNet [33] from 40.8 to 79.2, representing a 94.1% enhancement. This enhancement clearly shows the advantage of HybridPose on occluded objects, where predictions of invisible keypoints can be noisy, and visible keypoints may not provide sufficient constraints for pose regression alone. HybridPose also outperforms DPOD, the current stateoftheart pose estimator on Occlusion Linemod, by 67.4%. One explanation is that renderingbased approaches like DPOD work less well on occluded objects, due to the difficulty in modeling occlusions in data augmentation and correspondence computation.
Running time. On a desktop with 16core Intel(R) Xeon(R) E52637 CPU and GeForce GTX 1080 GPU, HybridPose takes 0.6 second to predict the intermediate representations, 0.4 second to regress the pose. Assuming a batch size of 30, this gives an average processing speed of 30 frames per second, enabling realtime analysis.
4.3 Ablation Study
We proceed to provide an ablation study. Table 3 summarizes the performance of HybridPose using different predicted intermediate representations. Since the performance of different methods is close to saturate on Linemod, the ablation study we present here is based on Occlusion Linemod, which clearly reveals the effect of different predicted elements on pose refinement. Ablation study on Linemod is deferred to the supp. material.
With keypoints. As a baseline approach, we first estimate object poses by only utilizing keypoint information. As shown in Table 3, this gives a mean absolute rotation error of 1.648 degrees, and a mean relative translation error of 0.100.
With keypoints and symmetry. Adding symmetry correspondences to keypoints leads to noticeable performance gains in the rotation component. The relative performance gain is 3.52%, and such improvement is almost consistent across all object categories. The consistent improvement clearly demonstrates the effectiveness of symmetry correspondences. On the other hand, the translation error of using keypoints and using keypoints + symmetry remains almost the same. One explanation is that symmetry correspondences only constrain two degrees of freedom in a total of three rotation parameters, while provides no constraints on translation parameters.
Full model. Adding edge vectors to keypoints and symmetry correspondences leads to salient performance gain in both rotation and translation estimations. The relative performance gains in rotation and translation are 25.85% and 44.12%, respectively. One explanation is that edge vectors provide more constraints on both translation and rotation. Compared with keypoints, edge vectors provide more constraints on translation as it represent adjacent keypoints displacement and provides gradient information for regression. As a result, the translation error is significantly reduced. Compared with symmetry correspondences, which only provide 2 constraints on rotation, edge vectors constrain 3 degrees of freedom on rotation parameters which boosts the performance of rotation estimation. Additionally, the improved rotation estimation helps the GM robust function in the refinement submodule to identify outliers in keypoint prediction.
5 Conclusions and Future Work
In this paper, we introduce HybridPose, a 6D pose estimation approach that utilizes keypoints, edge vectors, and symmetry correspondences. Experiments show that HybridPose enjoys realtime prediction and outperforms current stateoftheart pose estimation approaches in accuracy. HybridPose is robust to occlusion and extreme poses. In the future, we plan to extend HybridPose to include more intermediate representations such as shape primitives, normals, and planar faces. Another possible direction of future work is to enforce consistency across different representations as a selfsupervision loss in network training.
References
 [1] (2016) Reflection symmetry detection via appearance of structure descriptor. In Computer Vision  ECCV 2016  14th European Conference, Amsterdam, The Netherlands, October 1114, 2016, Proceedings, Part III, pp. 3–18. External Links: Link, Document Cited by: §2.

[2]
(2015)
Deepedge: a multiscale bifurcated deep network for topdown contour detection.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 4380–4389. Cited by: §2.  [3] (2014) Learning 6d object pose estimation using 3d object coordinates. In European conference on computer vision, pp. 536–551. Cited by: HybridPose: 6D Object Pose Estimation under Hybrid Representations, §1, §2, §4.1.
 [4] (2016) Uncertaintydriven 6d pose estimation of objects and scenes from a single rgb image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3364–3372. Cited by: §2.
 [5] (2017) Realtime multiperson 2d pose estimation using part affinity fields. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017, pp. 1302–1310. External Links: Link, Document Cited by: §2.
 [6] (2018) Pose estimation for objects with rotational symmetry. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7215–7222. Cited by: §2.
 [7] (200906) ImageNet: a largescale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 248–255. External Links: Document, ISSN Cited by: §4.1.
 [8] (2018) Seeing behind the scene: using symmetry to reason about objects in cluttered environments. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7193–7200. Cited by: §4.1.
 [9] (201512) Fast rcnn. 2015 IEEE International Conference on Computer Vision (ICCV). External Links: ISBN 9781467383912, Link, Document Cited by: §4.1.
 [10] (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385. Cited by: §3.2, §4.1.
 [11] (2013) Model based training, detection and pose estimation of textureless 3d objects in heavily cluttered scenes. In Proceedings of the 11th Asian Conference on Computer Vision  Volume Part I, ACCV’12, Berlin, Heidelberg, pp. 548–562. External Links: ISBN 9783642373305, Link, Document Cited by: §1, §4.1, §4.1.
 [12] (2004) On symmetry and multipleview geometry: structure, pose, and calibration from a single image. International Journal of Computer Vision 60 (3), pp. 241–265. Cited by: §2.
 [13] (2019) Segmentationdriven 6d object pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3385–3394. Cited by: §2, Table 2.
 [14] (2017) FlowNet 2.0: evolution of optical flow estimation with deep networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 2126, 2017, pp. 1647–1655. External Links: Link, Document Cited by: §3.2.
 [15] (2015) Posenet: a convolutional network for realtime 6dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pp. 2938–2946. Cited by: §1.
 [16] (2014) Adam: a method for stochastic optimization. External Links: 1412.6980 Cited by: §4.1.
 [17] (2009) Epnp: an accurate o (n) solution to the pnp problem. International journal of computer vision 81 (2), pp. 155. Cited by: §2, §3.3, §3.3.
 [18] (2018) Deepim: deep iterative matching for 6d pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 683–698. Cited by: §2.
 [19] (2019) CDPN: coordinatesbased disentangled pose network for realtime rgbbased 6dof object pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7678–7687. Cited by: §2, Table 1.
 [20] (2016) Symmetryaware depth estimation using deep neural networks. arXiv preprint arXiv:1604.06079. Cited by: §2.
 [21] (2010) Computational symmetry in computer vision and computer graphics. Foundations and Trends in Computer Graphics and Vision 5 (12), pp. 1–195. External Links: Link, Document Cited by: §2.
 [22] (2016) Learning relaxed deep supervision for better edge detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 231–240. Cited by: §2.
 [23] (2003) An invitation to 3d vision: from images to geometric models. SpringerVerlag. External Links: ISBN 0387008934 Cited by: §3.2.
 [24] (2019) Explaining the ambiguity of object detection and 6d pose from visual data. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6841–6850. Cited by: §2.
 [25] (2018) Deep modelbased 6d pose refinement in rgb. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 800–815. Cited by: §2.
 [26] (201309) Symmetry in 3d geometry: extraction and applications. Comput. Graph. Forum 32 (6), pp. 1–23. External Links: ISSN 01677055, Link, Document Cited by: §2.
 [27] (2016) Stacked hourglass networks for human pose estimation. In Computer Vision  ECCV 2016  14th European Conference, Amsterdam, The Netherlands, October 1114, 2016, Proceedings, Part VIII, pp. 483–499. External Links: Link, Document Cited by: §2.
 [28] (2018) Making deep heatmaps robust to partial occlusions for 3d object pose estimation. Lecture Notes in Computer Science, pp. 125–141. External Links: ISBN 9783030012670, ISSN 16113349, Link, Document Cited by: Table 2.
 [29] (2019) Pix2Pose: pixelwise coordinate regression of objects for 6d pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7668–7677. Cited by: §2, §2, Table 1.
 [30] (2011) Using facial symmetry to handle pose variations in realworld 3d face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (10), pp. 1938–1951. Cited by: §2.
 [31] (2017) 6dof object pose from semantic keypoints. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2011–2018. Cited by: §2, §2.
 [32] (2011) Scikitlearn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §5.
 [33] (2018) PVNet: pixelwise voting network for 6dof pose estimation. CoRR abs/1812.11788. Cited by: §1, §2, §2, §3.1, §3.2, §4.1, §4.1, §4.2, §4.2, Table 1, Table 2.
 [34] (201809) Lambda twist: an accurate fast robust perspective three point (p3p) solver. In The European Conference on Computer Vision (ECCV), Cited by: §3.3, §3.3.
 [35] (2017) BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3836. Cited by: §2, §2, Table 1.
 [36] (2018) Implicit 3d orientation learning for 6d object detection from rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 699–715. Cited by: §2.
 [37] (2018) Realtime seamless single shot 6d object pose prediction. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 1822, 2018, pp. 292–301. External Links: Link, Document Cited by: §1, Table 1.
 [38] (2015) Viewpoints and keypoints. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 712, 2015, pp. 1510–1519. External Links: Link, Document Cited by: §1.
 [39] (2019) Densefusion: 6d object pose estimation by iterative dense fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3343–3352. Cited by: §2, §2.
 [40] (2015) Reflection symmetry detection using locally affine invariant edge correspondence. IEEE Trans. Image Processing 24 (4), pp. 1297–1301. External Links: Link, Document Cited by: §2.

[41]
(2018)
PoseCNN: A convolutional neural network for 6d object pose estimation in cluttered scenes
. In Robotics: Science and Systems XIV, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA, June 2630, 2018, External Links: Link, Document Cited by: §1, §2, §2, §4.1, Table 2.  [42] (2011) Symmetric piecewise planar object reconstruction from a single image. In CVPR 2011, pp. 2577–2584. Cited by: §2.
 [43] (2019) DPOD: 6d pose object detector and refiner. External Links: 1902.11020 Cited by: §1, §2, §4.2, Table 1, Table 2.
 [44] (2019) PPGNet: learning pointpair graph for line segment detection. arXiv preprint arXiv:1905.03415. Cited by: §2.
 [45] (2017) Unsupervised learning of depth and egomotion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858. Cited by: §2, §2.
 [46] (2018) StarMap for categoryagnostic keypoint and viewpoint estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 318–334. Cited by: §2.
Comments
There are no comments yet.