Pose Estimation, Dense Prediction, Deep Learning
Given images of known, rigid objects, 6D object pose estimation describes the problem of determining the identity of the objects, their position and their orientation. Recent research focuses on increasingly difficult datasets with multiple objects per image, cluttered environments, and partially occluded objects. Symmetric objects pose a particular challenge for orientation estimation, because multiple solutions or manifolds of solutions exist. While the pose problem mainly receives attention from the computer vision community, in recent years there have been multiple robotics competitions involving 6D pose estimation as a key component, for example the Amazon Picking Challenge of 2015 and 2016 and the Amazon Robotics Challenge of 2017, where robots had to pick objects from highly cluttered bins. The pose estimation problem is also highly relevant in human-designed, less structured environments, e.g. as encountered in the RoboCup@Home competition[iocchi2015robocup], where robots have to operate within home environments.
2 Related Work
For a long time, feature-based and template-based methods were popular for 6D object pose estimation lowe2004distinctive, wagner2008pose, hinterstoisser2012gradient, hinterstoisser2012model. However, feature-based methods rely on distinguishable features and perform badly for texture-poor objects. Template-based methods do not work well if objects are partially occluded. With deep learning methods showing success for different image-related problem settings, models inspired or extending these have been used increasingly. Many methods use established architectures to solve sub-problems, as for example semantic segmentation or instance segmentation. Apart from that, most recent methods use deep learning for their complete pipeline. We divide these methods into two groups: Direct pose regression methods deep6dpose, xiang2017posecnn and methods that predict 2D-3D object correspondences and then solve the PnP problem to recover the 6D pose. The latter can be further divided into methods that predict dense, pixel-wise correspondences brachmann2014learning, brachmann2016uncertainty, krull2015learning and, more recently, methods that estimate the 2D coordinates of selected keypoints, usually the 3D object bounding box corners heatmaps, BB8, SSS6D, dope.
heatmaps predict the projection of the 3D bounding box as a heat map. To achieve robustness to occlusion, they predict the heat map independently for small object patches before adding them together. The maximum is selected as the corner position. If patches are ambiguous, the training technique implicitly results in an ambiguous heat map prediction. This method also uses Feature Mapping featuremapping, a technique to bridge the domain-gap between synthetic and real training data.
We note that newer approaches increasingly focus on the monocular pose estimation problem without depth information (brachmann2016uncertainty, deep6dpose, jafari2017ipose, BB8, heatmaps, SSS6D, dope, xiang2017posecnn). In addition to predicting the pose from RGB or RGB-D data, there are several refinement techniques for pose improvement after the initial estimation. li2018deepim introduce a render-and-compare technique that improves the estimation only using the original RGB input. If depth is available, ICP registration can be used to refine poses.
As a representative of the direct regression method, we discuss PoseCNN xiang2017posecnn in more detail. It delivered state-of-the-art performance on the occluded LINEMOD dataset and introduced a more challenging dataset, the YCB-Video Dataset. PoseCNN decouples the problem of pose estimation into estimating the translation and orientation separately. A pretrained VGG16 backbone is used for feature extraction. The features are processed in three different branches: Two fully convolutional branches estimate a semantic segmentation, center directions, and the depth for every pixel of the image. The third branch consists of a RoI pooling and a fully-connected architecture which regresses to a quaternion describing the rotation for each region of interest.
RoI pooling – i.e. cutting out and size normalizing an object hypothesis – was originally developed for the object detection problem girshick2015fast, where it is used to extract an object-centered and size-normalized view of the extracted CNN features. The following classification network, usually consisting of a few convolutional and fully-connected layers, then directly computes class scores for the extracted region. As RoI pooling focusses on individual object hypotheses, it looses contextual information, which might be important in cluttered scenes where objects are densely packed and occlude each other. RoI pooling requires random access to the source feature map for cutting out and interpolating features. Such random access patterns are expensive to implement in hardware circuits and have no equivalent in the visual cortex kandel2000principles. Additionally, RoI pooling is often followed by fully connected layers, which drive up parameter count and inference/training time.
Following the initial breakthroughs using RoI pooling, simpler architectures for object detection have been proposed which compute the class scores in a fully convolutional way redmon2016you. An important insight here is that a CNN is essentially equivalent to a sliding-window operator, i.e. fully-convolutional classification is equivalent to RoI-pooled classification with a fixed region size. While the in-built size-invariance of RoI pooling is lost, fully-convolutional architectures typically outperform RoI-based ones in terms of model size and training/inference speed. With a suitably chosen loss function that addresses the inherent example imbalances during training lin2017focal, fully-convolutional architectures reach state-of-the-art accuracy in object detection.
Following this idea, we developed a fully-convolutional architecture evolved from PoseCNN, that replaces the RoI pooling-based orientation estimation of PoseCNN with a fully-convolutional, pixel-wise quaternion orientation prediction (see Fig. 1). Recently, peng2019pvnet also removed the RoI-pooled orientation prediction branch, but with a different method: Here, 2D directions to a fixed number of keypoints are densely predicted. Each keypoint is found using a separate Hough transform and the pose is then estimated using a PnP solver utilizing the known keypoint correspondences. In contrast, our method retains the direct orientation regression branch, which may be interesting in resource-constrained scenarios, where the added overhead of additional Hough transforms and PnP solving is undesirable.
Our proposed changes unify the architecture and make it more parallel: PoseCNN first predicts the translation and the regions of interest (RoI) and then, sequentially for each RoI estimates object orientation. Our architecture can perform the rotation estimation for multiple objects in parallel, independent from the translation estimation. We investigated different averaging and clustering schemes for obtaining a final orientation from our pixel-wise estimation. We compare the results of our architecture to PoseCNN on the YCB-Video Dataset xiang2017posecnn. We show that our fully-convolutional architecture with pixel-wise prediction achieves precise results while using far less parameters. The simpler architecture also results in shorter training times.
In summary, our contributions include:
A conceptually simple, small, and fast-to-train architecture for dense orientation estimation, whose prediction is easily interpretable due to its dense nature,
a comparison of different orientation aggregation techniques, and
a thorough evaluation and ablation study of the different design choices on the challenging YCB-Video dataset.
We propose an architecture derived from PoseCNN xiang2017posecnn, which predicts, starting from RGB images, 6D poses for each object in the image. The network starts with the convolutional backbone of VGG16 simonyan2014very that extracts features. These are subsequently processed in three branches: The fully-convolutional segmentation branch that predicts a pixel-wise semantic segmentation, the fully-convolutional vertex branch, which predicts a pixel-wise estimation of the center direction and center depth, and the quaternion estimation branch. The segmentation and vertex branch results are combined to vote for object centers in a Hough transform layer. The Hough layer also predicts bounding boxes for the detected objects. PoseCNN then uses these bounding boxes to crop and pool the extracted features which are then fed into a fully-connected neural network architecture. This fully-connected part predicts an orientation quaternion for each bounding box.
Our architecture, shown in Fig. 2, replaces the quaternion estimation branch of PoseCNN with a fully-convolutional architecture, similar to the segmentation and vertex prediction branch. It predicts quaternions pixel-wise. We call it ConvPoseCNN (short for convolutional PoseCNN). Similarly to PoseCNN, quaternions are regressed directly using a linear output layer. The added layers have the same architectural parameters as in the segmentation branch (filter size 33) and are thus quite light-weight.
While densely predicting orientations at the pixel level might seem counter-intuitive, since orientation estimation typically needs long-range information from distant pixels, we argue that due to the total depth of the convolutional network and the involved pooling operations the receptive field for a single output pixel covers large parts of the image and thus allows long-range information to be considered during orientation prediction.
3.1 Aggregation of Dense Orientation Predictions
We estimate quaternions pixel-wise and use the predicted segmentation to identify which quaternions belong to which object. If multiple instances of one object can occur, one could use the Hough inliers instead of the segmentation. Before the aggregation of the selected quaternions to a final orientation estimate, we ensure that each predicted quaternion corresponds to a rotation by scaling it to unit norm. However, we found that the norm prior to scaling is of interest for aggregation: In feature-rich regions, where there is more evidence for the orientation prediction, it tends to be higher (see Section 4.8). We investigated averaging and clustering techniques for aggregation, optionally weighted by .
For averaging the predictions we use the weighted quaternion average as defined by markley2007quaternion. Here, the average of quaternion samples with weights is defined using the corresponding rotation matrices :
where is the unit 3-sphere and
is the Frobenius norm. This definition avoids any problems arising from the antipodal symmetry of the quaternion representation. The exact solution to the optimization problem can be found by solving an eigenvalue problem markley2007quaternion.
For the alternative clustering aggregation, we follow a weighted RANSAC scheme: For quaternions and their weights associated with one object this algorithm repeatedly chooses a random quaternion
with a probability proportional to its weight and then determines the inlier set, where is the angular distance. Finally, the with largest is selected as the result quaternion.
The possibility of weighting the individual samples is highly useful in this context, since we expect that parts of the object are more important for determining the correct orientation than others (e.g. the handle of a cup). In our architecture, sources of such pixel-wise weight information can be the segmentation branch with the class confidence scores, as well as the predicted quaternion norms before normalization.
3.2 Losses and Training
For training the orientation branch, xiang2017posecnn propose the ShapeMatch loss. This loss calculates a distance measure between point clouds of the object rotated by quaternions and :
Given a set of 3D points , where m = and and are the rotation matrices corresponding to ground truth and estimated quaternion, respectively, and Ploss and Sloss are defined in xiang2017posecnn as follows:
Similar to the ICP objective, SLoss does not penalize rotations of symmetric objects that lead to equivalent shapes.
In our case, ConvPoseCNN outputs a dense, pixel-wise orientation prediction. Computing the SMLoss pixel-wise is computationally prohibitive. First aggregating the dense predictions and then calculating the orientation loss makes it possible to train with SMLoss. In this setting, we use a naive average, the normalized sum of all quaternions, to facilitate backpropagation through the aggregation step. As a more efficient alternative we experiment with pixel-wise L2 or Qloss billings2018silhonet loss functions, that are evaluated for the pixels indicated by the ground-truth segmentation. Qloss is designed to handle the quaternion symmetry. For two quaternionsand it is defined as:
where is introduced for stability.
The final loss function used during training is, similarly to PoseCNN, a linear combination of segmentation (), translation (), and orientation loss ():
We perform our experiments on the challenging YCB-Video Dataset xiang2017posecnn. The dataset contains 133,936 images extracted from 92 videos, showing 21 rigid objects. For each object the dataset contains a point model with 2620 points each and a mesh file. Additionally the dataset contains 80.000 synthetic images. The synthetic images are not physically realistic. Randomly selected images from SUN2012 xiao2010sun and ObjectNet3D xiang2016objectnet3d are used as backgrounds for the synthetic frames.
When creating the dataset only the first frame of each video was annotated manually and the rest of the frames were inferred using RGB-D SLAM techniques. Therefore, the annotations are sometimes less precise.
The images contain multiple relevant objects in each image, as well as occasionally uninteresting objects and distracting background. Each object appears at most once in each image. The dataset includes symmetric and texture-poor objects, which are especially challenging.
4.2 Evaluation Metrics
We evaluate our method under the AUC P and AUC S metrics as defined for PoseCNN xiang2017posecnn. For each model we report the total area under the curve for all objects in the test set. The AUC P variant is based on a point-wise distance metric which does not consider symmetry effects (also called ADD). In contrast, AUC S is based on an ICP-like distance function (also called ADD-S) which is robust against symmetry effects. For details, we refer to xiang2017posecnn. We additionally report the same metric when the translation is not applied, referred to as “rotation only”.
We implemented our experiments using the PyTorch framework paszke2017automatic, with the Hough voting layer implemented on CPU using Numba numba, which proved to be more performant than a GPU implementation. Note that there is no backpropagation through the Hough layer.
For the parts that are equivalent to PoseCNN we followed the published code, which has some differences to the corresponding publication xiang2017posecnn, including the application of dropout and estimation of instead of in the translation branch. We found that these design choices improve the results in our architecture as well.
For training ConvPoseCNN we generally follow the same approach as for PoseCNN: We use SGD with learning rate 0.001 and momentum 0.9. For the overall loss we use . For the L2 and the Qloss we use also , for the SMLoss we used . To bring the depth error to a similar range as the center direction error, we scale the (metric) depth by a factor of 100.
We trained our network with a batch size of 2 for approximately 300,000 iterations utilizing the early stopping technique. Since the YCB-Video Dataset contains real and synthetic frames, we choose a synthetic image with a probability of 80% and render it onto a random background image from the SUN2012 xiao2010sun and ObjectNet3D xiang2016objectnet3d dataset.
4.5 Prediction Averaging
|Method||6D pose xiang2017posecnn||Rotation only|
|AUC P||AUC S||AUC P||AUC S|
Calculated from the PoseCNN model published in the YCB-Video Toolbox.
We first evaluated the different orientation loss functions presented in Section 3.2: L2, Qloss, and SMLoss. For SMLoss, we first averaged the quaternions predicted for each object with a naive average before calculating the loss.
The next pipeline stage after predicting dense orientation is the aggregation into a single orientation. We first investigated the quaternion average following markley2007quaternion, using either segmentation confidence or quaternion norm as sample weights. As can be seen in Table 1, norm weighting showed the best results.
|Method||6D pose xiang2017posecnn||Rotation only|
|AUC P||AUC S||AUC P||AUC S|
Since weighting seemed to be beneficial, which suggests that there are less precise or outlier predictions that should be ignored, we experimented with pruning of the predictions using the following strategy: The quaternions are sorted by confidence and the least confident ones, according to a removal fractionare discarded. The weighted average of the remaining quaternions is then computed as described above. The results are shown as pruned() in Table 2. We also report the extreme case, where only the most confident quaternion is left. Overall, pruning shows a small improvement, with the ideal value of depending on the target application. More detailed evaluation shows that especially the symmetric objects show a clear improvement when pruning. We attribute this to the fact that the averaging methods do not handle symmetries, i.e. an average of two shape-equivalent orientations can be non-equivalent. Pruning might help to reduce other shape-equivalent but L2-distant predictions and thus improves the final prediction.
4.6 Prediction Clustering
|Method||6D pose xiang2017posecnn||Rotation only|
|AUC P||AUC S||AUC P||AUC S|
RANSAC uses unit weights, while W-RANSAC is weighted by quaternion norm. PoseCNN and the best performing averaging methods are shown for comparison. Numbers in parentheses describe the clustering threshold in radians.
|6D pose xiang2017posecnn||Rotation only||NonSymC||SymC||Translation||Segmentation|
|AUC P||AUC S||AUC P||AUC S||AUC P||AUC S||Error [m]||IoU|
|PoseCNN (own impl.)||53.29||78.31||69.00||90.49||60.91||57.91||0.0465||0.8071|
|PoseCNN (own impl.)||52.90||80.11||69.60||91.63||76.63||84.15||0.0345||1|
The average translation error, the segmentation IoU and the AUC metrics for different models. The AUC results were achieved using weighted RANSAC(0.1) for ConvPoseCNN Qloss, Markley’s norm weighted average for ConvPoseCNN Shape and weighted RANSAC(0.2) for ConvPoseCNN L2. GT segm. refers to ground truth segmentation (i.e. only pose estimation).
For clustering with the RANSAC strategies, we used the angular distance between rotations as the clustering distance function and performed 50 RANSAC iterations. In contrast to the L2 distance in quaternion space, this distance function does not suffer from the antipodal symmetry of the quaternion orientation representation. The results for ConvPoseCNN L2 are shown in Table 3. For comparison the best-performing averaging strategies are also listed. The weighted RANSAC variant performs generally a bit better than the non-weighted variant for the same inlier thresholds, which correlates to our findings in Section 4.5. In comparison, clustering performs slightly worse than the averaging strategies for AUC P, but slightly better for AUC S—as expected due to the symmetry effects.
4.7 Loss Variants
|xiang2017posecnn Total||Rotation only|
|AUC P||AUC S||AUC P||AUC S|
The aggregation methods showed very similar results for the Qloss trained model, which are omitted here for brevity. For the SMLoss variant, we report the results in Table 5. Norm weighting improves the result, but pruning does not. This suggests that there are less-confident but important predictions with higher distance from the mean, so that their removal significantly affects the average. This could be an effect of training with the average quaternion, where such behavior is not discouraged. The RANSAC clustering methods generally produce worse results than the averaging methods in this case. We conclude that the average-before-loss scheme is not advantageous and a fast dense version of SMLoss would need to be found in order to apply it in our architecture. The pixel-wise losses obtain superior performance.
4.8 Final Results
Figure 3 shows qualitative results of our best-performing model on the YCB-Video dataset. We especially note the spatial structure of our novel dense orientation estimation. Due to the dense nature, its output is strongly correlated to image location, which allows straightforward visualization and analysis of the prediction error w.r.t. the involved object shapes. As expected, regions that are close to boundaries between objects or far away from orientation-defining features tend to have higher prediction error. However, this is nicely compensated by our weighting scheme, as the predicted quaternion norm before normalization correlates with this effect, i.e. is lower in these regions. We hypothesize that this is an implicit effect of the dense loss function: In areas with high certainty (i.e. easy to recognize), the network output is encouraged strongly in one direction. In areas with low certainty (i.e. easy to confuse), the network cannot sufficiently discriminate and gets pulled into several directions, resulting in outputs close to zero.
In Table 4
, we report evaluation metrics for our models with the best averaging or clustering method. As a baseline, we include the PoseCNN results, computed from the YCB-Video Toolbox model111https://github.com/yuxng/YCB_Video_toolbox. We also include our re-implementation of PoseCNN. We achieved similar final AUCs on the test set. We also show more detailed results with regard to translation and segmentation of the different models. For this we report the average translation error and the segmentation IoU for all models in Table 4. They show that there is a strong influence of the translation estimation on the AUC losses. However, for the models with better translation estimation, the orientation estimation is worse.
For the total as reported by PoseCNN, all three ConvPoseCNNs have a bit higher AUC than PoseCNN, but only the model trained with Qloss has a similar orientation estimation to PoseCNN. Compared to PoseCNN, some models perform better for the orientation and some better for the translation even though the translation estimation branch is the same for all of these networks. We were interested in the models performance with regard to the symmetric and non-symmetric objects. For this we calculated the class-wise average over the AUCs for the symmetric and non-symmetric objects separately. In Table 4 we report them as NonSymC and SymC and report AUC P and AUC S respectively. ConvPoseCNN performed a bit better than PoseCNN for the non-symmetric objects but worse for the symmetric ones. This is not surprising since Qloss and L2 loss are not designed to handle symmetric objects. The model trained with SMLoss also achieves suboptimal results for the symmetric objects compared to PoseCNN. This might be due to different reasons: First, we utilize an average before calculating the loss; therefore during training the average might penalize predicting different shape-equivalent quaternions, in case their average is not shape-equivalent. Secondly, there are only five symmetric objects in the dataset and we noticed that two of those, the two clamp objects, are very similar and thus challenging, not only for the orientation but as well for the segmentation and vertex prediction. This is further complicated by a difference in object coordinate systems for these two objects.
We also included results in Table 4 that were produced by evaluating using the ground truth semantic segmentation, in order to investigate how much our model’s performance could improve by the segmentation performance alone. If the segmentation is perfect, then the orientation and the translation estimation of all models improve. Even the re-implemented PoseCNN improves its orientation; therefore the RoIs must have improved by the better translation and inlier estimation. Even though our aim was to change the orientation estimation of PoseCNN, our results show that this cannot be easily isolated from the translation estimation, since both have large effects on the resulting performance. In our experiments, further re-balancing of the loss coefficients was not productive due to this coupled nature of the translation and orientation sub-problems.
We conclude that finding a proper balancing between translation and orientation estimation is important but difficult to achieve. Also, a better segmentation would further improve the results.
5 Comparison to Related Work
|AUC P||AUC S||AUC heatmaps|
|heatmaps without FM||61.41|
|heatmaps with FM||72.79|
Comparison between PoseCNN (as reported in xiang2017posecnn), ConvPoseCNN L2 with pruned(0.75), and heatmaps without and with Feature Mapping (FM).
|AUC P||AUC S||AUC P||AUC S|
In Table 6
we compare ConvPoseCNN L2, to the values reported in the PoseCNN paper, as well as with a different class-wise averaged total as in heatmaps. We also compare to the method of heatmaps, with and without their proposed Feature Mapping technique, as it should be orthogonal to our proposed method. One can see that our method slightly outperforms PoseCNN, but we make no claim of significance, since we observed large variations depending on various hyperparameters and implementation details. We also slightly outperform heatmaps without Feature Mapping.Table 7 shows class-wise results.
We also investigated applying the Feature Mapping technique heatmaps to our model. Following the process, we render synthetic images with poses corresponding to the real training data. We selected the extracted VGG-16 features for the mapping process and thus have to transfer two feature maps with 512 features each. Instead of using a fully-connected architecture as the mapping network, as done in heatmaps, we followed a convolutional set-up and mapped the feature from the different stages to each other with residual blocks based on convolutions.
The results are reported in Table 6. However, we did not observe the large gains reported by heatmaps for our architecture. We hypothesize that the feature mapping technique is highly dependent on the quality and distribution of the rendered synthetic images, which are maybe not of sufficient quality in our case.
6 Time Comparisons
|ConvPoseCNN L2||2.09||308.9 MiB|
|ConvPoseCNN Qloss||2.09||308.9 MiB|
|ConvPoseCNN Shapeloss||1.99||308.9 MiB|
Using a batch size of 2. Averaged over 400 iterations.
We timed our models on an NVIDIA GTX 1080 Ti GPU with 11 GB of memory. Table 8 lists the training times for the different models, as well as the model sizes when saved. The training of the ConvPoseCNNs is almost twice as fast and the models are much smaller compared to PoseCNN.
|Method||Time [ms]1||Aggregation [ms]|
|- naive average||136.96||2.34|
|- weighted average||146.92||13.00|
|- pruned w. average||148.61||14.64|
|- w. RANSAC||563.16||65.82|
Single frame, includes aggregation.
The speed of the ConvPoseCNN models at test time depends on the method used for quaternion aggregation. The times for inference are shown in Table 9. For the averaging methods the times do not differ much from PoseCNN. PoseCNN takes longer to produce the output, but then does not need to perform any other step. For ConvPoseCNN the naive averaging method is the fastest, followed by the other averaging methods. RANSAC is, as expected, slower. The forward pass of ConvPoseCNN takes about 65.5 ms, the Hough transform around 68.6 ms. We note that the same Hough transform implementation is used for PoseCNN and ConvPoseCNN in this comparison.
In summary, we gain advantages in terms of training time and model size, while inference times are similar. While the latter finding initially surprised us, we attribute it to the high degree of optimization that RoI pooling methods in modern deep learning frameworks have received.
As shown in this work, it is possible to directly regress 6D pose parameters in a fully-convolutional way, avoiding the sequential cutting out and normalizing of individual object hypotheses. Doing so yields a much smaller, conceptually simpler architecture with fewer parameters that estimates the poses of multiple objects in parallel. We thus confirm the corresponding trend in the related object detection task—away from RoI-pooled architectures towards fully-convolutional ones—also for the pose estimation task.
We demonstrated benefits of the architecture in terms of the number of parameters and training time without reducing prediction accuracy on the YCB-Video dataset. Furthermore, the dense nature of the orientation prediction allowed us to visualize both prediction quality and the implicitly learned weighting and thus to confirm that the method attends to feature-rich and non-occluded regions.
An open research problem is the proper aggregation of dense predictions. While we presented methods based on averaging and clustering, superior (learnable) methods surely exist. In this context, the proper handling of symmetries becomes even more important. In our opinion, semi-supervised methods that learn object symmetries and thus do not require explicit symmetry annotation need to be developed, which is an exciting direction for further research.
Acknowledgment: This work was funded by grant BE 2556/16-1 (Research Unit FOR 2535Anticipating Human Behavior) of the German Research Foundation (DFG).