Pose Proposal Critic: Robust Pose Refinement by Learning Reprojection Errors

05/13/2020 ∙ by Lucas Brynte, et al. ∙ Chalmers University of Technology 12

In recent years, considerable progress has been made for the task of rigid object pose estimation from a single RGB-image, but achieving robustness to partial occlusions remains a challenging problem. Pose refinement via rendering has shown promise in order to achieve improved results, in particular, when data is scarce. In this paper we focus our attention on pose refinement, and show how to push the state-of-the-art further in the case of partial occlusions. The proposed pose refinement method leverages on a simplified learning task, where a CNN is trained to estimate the reprojection error between an observed and a rendered image. We experiment by training on purely synthetic data as well as a mixture of synthetic and real data. Current state-of-the-art results are outperformed for two out of three metrics on the Occlusion LINEMOD benchmark, while performing on-par for the final metric.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Accurately estimating the 3D location and orientation of an object from a single image, a.k.a. rigid object pose estimation, has many important real world applications, such as robotic manipulation, augmented reality and autonomous driving. Although the problem has commonly been addressed by exploiting RGB-D cameras, e.g., [1], this introduces an increased cost of hardware, sensitivity to sunlight, and unreliable / missing depth measurements for reflective / transparent objects. In recent years, more attention has been put to RGB-only pose estimation, and although considerable progress has been made, a major challenge remains in achieving robustness to partial occlusions. To this end, rendering-based pose refinement methods have shown promise in order to achieve improved results, but their full potential remains unexplored.

In this paper we revisit pose refinement via rendering, and focus specifically on how to further improve on the robustness of such methods, in particular with respect to partial occlusions. Our method can hence be used to refine the estimates of any pose algorithm and we will give several experimental demonstrations that this is indeed achieved for different algorithms. Naturally, we will also compare to other refinement methods.

Shared among contemporary rendering-based pose refinement methods [2, 3, 4]

is the approach of feeding an observed as well as a synthetically rendered image as input to a CNN model, which is trained to predict the relative pose of an object between the two images. Our key insight is that rendering-based pose refinement is possible without explicitly regressing to the parameter vector of the relative pose. Instead, estimating an error function of the relative pose is enough, since minimization of said error function w.r.t. pose can be done during inference. Figure

1 shows error function estimates for two test frames. Although a larger bias is observed in the occluded case, the estimated minimum is still close to ground-truth.

(a) (b) (c) (d)
Figure 1: Estimated average reprojection error of our network for the cat object in two test frames of Occlusion LINEMOD [1]. A rotational perturbation is applied for a fixed axis in the camera coordinate frame, in the range of degrees. The estimated minimum is marked in the figure. (a-b) An unoccluded example. (c-d) An example with occlusion.

Our main contributions can be summarized as follows:

  • A novel pose refinement method, which works well without real training data.

  • Robustness to partial occlusions, by the implicit nature of our method, making it insensitive to over- or under-estimations of the error function.

  • State-of-the-art results for two out of three metrics on Occlusion LINEMOD [1].

Our pose refinement pipeline takes as input (a) an object CAD model and (b) an initial pose estimate, referred to as a “pose proposal”. The pose proposal is assumed to be obtained from another method, and is fed to the refinement pipeline, consisting of three parts:

  1. Synthetic rendering of the detected object under the pose proposal.

  2. Estimation of the average reprojection error of all model points, when projected into the image using the ground truth pose as well as the pose proposal.

  3. Iterative refinement of the 6D pose estimate by minimizing the reprojection error.

We will refer to our method as Pose Proposal Critic (PPC), since the heart of the method involves judging the quality of a pose proposal.

2 Related Work

Similarly to the work of Kendall et al. for camera localization [5], the rigid object pose estimation methods of Xiang et al. [6] and Do et al. [7] estimate the pose by directly regressing the pose parameters. The most successful methods for rigid object pose estimation do however make use of a two-stage pipeline, where 2D-3D correspondences are first established, and the object pose is then retrieved by solving the corresponding camera resectioning problem [8]. A common approach has been to regress 2D locations of a discrete set of object keypoints, projected into the image, yielding a sparse set of correspondences [9, 10, 11]. Other methods instead output heatmaps in order to encode said keypoint locations [12, 13].

Among the sparse correspondence methods, Oberweger et al. [13] stand out in that they address the problem of partial occlusions very carefully. They show that occluders typically have a corrupting effect on CNN activations, far beyond the occluded region itself, and that training with occluded samples might not help to overcome this problem. Instead they resort to limiting the receptive field of their keypoint detector, as a crude but effective way to limit the impact of occluders.

Dense correspondence methods on the other hand, are inherently more robust to large variances in correspondence estimates. Pixel-wise regression on the corresponding (object frame) 3D coordinates has been proposed by

[14, 15, 16], while Zakharov et al. [4] take a different approach and discretize the object surface into smaller segments, and then perform classification on which segment is visible in which pixel. Hu et al. [17] use a sparse set of keypoints, but yet leverage on dense correspondences due to redundantly regressing a number of 2D locations for each object keypoint. Peng et al. [18] take a similar approach, but simplify the output space by regressing to the direction from each pixel to each of the projected keypoints, but not the corresponding distance. Pair-wise sampling of pixel-wise predictions then yields votes for keypoint locations.

The method of Peng et al. [18] does indeed prove robust to partial occlusions, and yields accurate estimates of rotation as well as lateral translation on the Occlusion LINEMOD benchmark. Consequently, the results are state-of-the-art for the depth-insensitive metric based on reprojection errors. Nevertheless, their depth estimates are still not accurate, and suffer in the presence of partial occlusion. The rendering-based pose refinement method of DeepIM (Li et al. [3]) does however perform well on Occlusion LINEMOD for all common metrics, and in particular gives a huge boost in depth estimation accuracy, yielding state-of-the-art results for the metric based on matching point clouds in 3D, and suggesting that rendering-based pose refinement is a powerful tool for accurate pose estimation in the presence of partial occlusion. We will experimentally compare to DeepIM and show how one can achieve significantly improved results for partial occlusions.

Moreover, we point out that while a multitude of approaches for increasing robustness, especially to partial occlusions, has been observed among correspondence-based pose estimation methods, we have not yet seen any directed efforts to address these issues in the literature of rendering-based pose refinement.

When it comes to rendering-based pose refinement, early work was done by Tjaden and Schömer [19], proposing a segmentation pipeline based on hand-crafted features, and iterative alignment of silhouettes. Rad and Lepetit [9] also apply rendering-based refinement as part of the BB8 pose estimation pipeline, improving on initial estimates. BB8 is based on sparse correspondences ( bounding box corners), and a refinement CNN is trained to regress the reprojection errors for each of the bounding box corners. Refinement is then carried out on the correspondences themselves, yielding an updated camera resectioning problem to be solved. In contrast, our method instead estimates the average reprojection error over all model points and refinement is done directly on the pose.

Manhardt et al. [2], Li et al. [3] and Zakharov et al. [4], all propose a CNN-based refinement pipeline, where the model is trained to learn the relative pose between an observed image and a synthetically rendered image under a pose proposal. The main difference between their approaches and ours is that we instead choose to learn an error function of the relative pose. Among these methods, [3] is the only one that handles partial occlusions well. The results of [4] seem competitive at a first glance, but evaluation is only carried out on the frames for which the 2D object detector successfully detected the object of interest, and furthermore parts of the Occlusion LINEMOD dataset were used for training, which does not allow for a fair comparison.

3 Method

In this section, the three main parts of our pipeline will be described in detail.

The core idea of our approach is that even though neural networks have an amazing capacity to learn difficult estimation tasks, the learning problem should be kept as simple as possible. Given a pose proposal, the task of our network is to determine how good the proposal is with respect to the ground truth. So, instead of trying to learn the pose parameters directly, it is only required for the network to act as a critic of different proposals. To further simplify the task, we render a synthetic image using the pose proposal, and then the network only needs to determine if the

rendered image is similar to the observed image or not. As a measure of similarity, we use the average reprojection error of object CAD model points. Then, at inference, the objective is to find the pose parameters with lowest predicted reprojection error, resulting in a minimization problem which can be solved with standard optimization techniques.

It is assumed that intrinsic camera parameters are known for the observed images, and that a three-dimensional CAD model of the object of interest is available.

3.1 Part I: Rendering the Object Under a Pose Proposal

Similar to previous work [9, 3, 2, 4], we render a synthetic image of a detected object based on the suggested pose proposal. Rendering is done on the GPU using OpenGL with Lambertian shading and the light source at the camera center. The background is kept black.

Rather than using all of the observed image directly, we zoom in on the detected object. Zooming is done based on the current pose proposal yielding square image patches centered at the projection of the object center. The size of the corresponding image patches (observed and rendered) is chosen as times the projection of the object diameter, and furthermore, the observed patch is bilinearly upsampled to pixels. Note that the object will be centered in the rendered image patch, but need not be centered in the observed image unless the pose proposal is accurate.

For future reference, let denote the zoom-in operator for pose proposal , acting on observed image, , resulting in image patch . We will denote the rendered image patch by . For performance reasons, the patch is rendered at resolution, and then bilinearly upsampled to .

3.2 Part II: Learning Average Reprojection Error

We use a pretrained optical flow network as backbone, add a regression head and finetune in order to take the observed and rendered image patches as input, and output an estimate of the average reprojection error, i.e., the average image distance between the projected CAD model points using the ground truth and the pose proposal, respectively.

Let be the estimated error of the neural network, where is the observed image patch and is the rendered image patch for pose proposal . If denotes the projection of a 3D point onto the image patch using pose 111Note that the projection operator itself depends on the pose, due to the dependence of the zoomed-in image patch, and thus the effective intrinsic camera parameters, on the estimated object position., the reprojection error to be estimated by the network is then given by

(1)

where and are the estimated and true poses, respectively, and are the object model points. Figure 1 shows the estimated error function for two test frames of Occlusion LINEMOD, one in which the object is partially occluded.

The reprojection error is measured in image patch pixels, i.e. after zoom-in rather than before. Estimating the reprojection error before zoom-in would require the network to estimate and rescale with the absolute depth, which would introduce an unnecessary complication. The reason we choose the reprojection error is that we expect it to be relatively easy to infer from image pairs without a lot of high level reasoning, and thus providing a relatively easy learning task. Furthermore, the reprojection error is quite related to optical flow, and should fit particularly well with a pretrained optical flow backbone.

For further details on the implementation, we refer to the appendix. Section A.1 gives more details on the network architecture and Section A.2

on the loss function and hyperparameters used for training. How to sample the pose proposals during training is covered in Section 

A.3, while Section A.4 describes data augmentation strategies, and in particular how to generate synthetic training examples.

3.3 Part III: Minimizing Reprojection Error

Once the CNN is trained for estimating reprojection errors, we may apply it for refining an initial pose proposal of our object of interest in the observed image .

Let the compound function encapsulate the operations of rendering, zoom-in and the CNN itself, which leads to the optimization problem: . We minimize locally, initializing at . Gradient-based optimization is carried out and although analytical differentiation is a tempting approach in the light of differentiable renderers such as [20], we observed noisy behavior in , and we instead apply numerical differentiation for robustly estimating .

For parameterizing the rotation, we take advantage of the Lie Algebra of . The initial rotation is used as a reference point and the parameterization is , where the parameters constitute the three elements of the skew-symmetric matrix . The translation is split into two parts. The lateral translation represents the deviation from the projection of the initial position in pixels, i.e. , where denotes the initial pose proposal, and denotes the translation part specifically. The depth is parameterized as , where is the initial depth estimate.

3.3.1 Optimization Scheme

It needs to be stressed that there may be spurious local minima in , and for this reason care should be taken during the optimization. For better control over the procedure, we let decoupled optimizers run in parallel for the rotation / depth / lateral translation parameters, with different hyperparameters and step size decay schedules. We apply in total iterations.

In order to handle non-convex and noisy behavior of , we use stochastic optimization. In particular, the Adam optimizer [21] proved effective for handling the fact that may be quite steep in the vicinity of the optimum, yet quite flat farther away from the optimum. This property did otherwise risk the optimizer taking too far steps when encountering "steep" points or easily getting stuck in local minima, if the step size was too high or low, respectively.

Consider for now the optimization w.r.t. rotation and depth. The optimization is roughly carried out in two phases, first w.r.t. rotation and then depth, with a smooth transition between the two. This sequential strategy is due to two reasons: (1) Although optimization w.r.t. rotation works well despite a sub-optimal depth estimate, keeping

fixed reduces noise for the moment estimates of the optimizer. (2) For precise depth estimation, a good estimate of the other parameters is crucial. The reprojection error to be estimated, as well as the

rendered image itself, is much less sensitive to depth perturbations than to the other pose parameters, and focusing specifically on in the final stage helped to improve depth estimation.

Optimization w.r.t. lateral translation proved relatively easy, and less coupled with the other parameters, i.e. a reasonable minimum may be found despite e.g. a poor rotation estimate. Particularly fast convergence of the lateral translation is desirable for the converse reason, that optimization w.r.t. the other parameters is coupled with the lateral translation estimate, and may not work well unless this is adequate. Luckily, convergence of these parameters is achieved in just a couple of iterations when using a plain SGD optimizer with momentum rather than Adam. The step size w.r.t.  is set constantly to .

The step size decay schedule for all parameters is illustrated in Figure 2 and the exponential decay rates for the moment estimation of Adam were set to for , and for .

Figure 2: Step size decay schedules for the different parameters , and during inference. The decay value is relative to the respective initial step sizes.

The step sizes used for finite differences were , and for , , , respectively.

4 Experiments

4.1 Datasets and Training Data

Experiments are carried out on LINEMOD as well as Occlusion LINEMOD.

LINEMOD is a standard benchmark for rigid object pose estimation and was introduced by Hinterstoisser et al. [22]. The dataset consists of object CAD models along with RGB-D image sequences of an indoor scene where objects are laid out on a table with cluttered background. For each sequence there is a corresponding object of interest put at the center. Although depth images are provided, it is also a common benchmark for RGB-only pose estimation. As two of the objects suffer from low quality CAD models, they are commonly excluded from evaluation and we follow the same practice.

The Occlusion LINEMOD dataset was produced by Brachmann et al. [1] by taking one of the LINEMOD sequences and annotating the pose of the surrounding objects. While the central object is typically unoccluded, the surrounding objects are often partially occluded, resulting in a challenging dataset. The central object is not part of the benchmark.

For experiments on Occlusion LINEMOD we use real images from LINEMOD, as has conventionally been done in the literature, while the Occlusion LINEMOD images are only used for testing. With probability we sample a real training image, and with probability a synthetic one. In the synthetic case, there is a probability that occluding objects are rendered, and a probability that no occluders are rendered. We carry out additional experiments on Occlusion LINEMOD where the model is trained only on synthetic data, still with of the samples being occluded.

For experiments on LINEMOD, we split training and test data exactly as [3], with the same samples for training and samples for test. With probability we sample a real training image and with probability a synthetic one, but without any occluders.

The objects known as eggbox and glue, present in both datasets, are conventionally considered symmetric w.r.t. a degree rotation, but it can be argued whether these are considered actual symmetries. Nevertheless, for these objects we duplicate the initial pose proposals with their degree rotated equivalents and refine the pose using both initializations. In the end, the iterate with the least estimated error is chosen.

4.2 Evaluation Metrics for Pose Refinement

For evaluation of our pose refinement method, we use three conventional metrics, explained in the following. All of them are defined as the percentage of annotated object instances for which the pose is correctly estimated, i.e., the recall according to the specific ways of quantifying the error.

The average distance metric add-d [22] is the percentage of object instances for which the object point cloud, when transformed with the estimated pose as well as the ground-truth pose, has an average distance less than of the diameter of the object. The add-s-d metric [22] is closely related, and only differs in that the closest point distance is used, rather than the distance between corresponding points. In general add-d is used, but add-s-d is used for objects that are considered symmetrical, and we let add(-s)-d refer to the two of them together. The reproj-px metric is similar to add-d, but differs in that the transformed point clouds are projected into the image before the mean distance is computed. The acceptance threshold is set to 5 pixels. Finally, the 5cm/ metric accepts a pose estimate if the rotational and translational components differ from their ground-truth equivalents by at most degrees and cm, respectively.

4.2.1 "Symmetric" Objects and Faulty Annotations

When it comes to the reproj-px and 5cm/ metrics, they do typically not take symmetries into account. This is not a huge problem for the LINEMOD dataset, partly because the high correlation between training and test data may help resolve any potential symmetries, and partly because, as pointed out earlier, none of the objects are truly symmetrical.

For Occlusion LINEMOD however, the eggbox object is unfortunately annotated according to the supposedly equivalent degrees rotated pose, in all but the first frames. For this reason, above mentioned metrics make little sense. Li et al. [3] do however modify these metrics in order to evaluate against the most beneficial of all proposed symmetries, which makes much more sense given the circumstances. We follow their proposal and perform the evaluation on Occlusion LINEMOD w.r.t. these symmetrically aware metrics, which we will refer to as reproj-s-px and 5cm/-s.

4.3 Pose Refinement Results

Here we present our main pose refinement results. For a comparison of different backbone networks and detailed per-object results, we refer the reader to Section B.1 and Section B.3 in the appendix. Illustrations of refinement iterates are also available, in Section B.2 .

State-of-the-art comparisons on the Occlusion LINEMOD dataset are given in Table 1. DeepIM [3] used initializations from PoseCNN [6], but as these predictions are not publicly available, we instead rely on initial pose proposals from PVNet [18]. The evaluation of PVNet was carried out by us and is based on the clean-pvnet implementation along with pre-trained models222 At times we observed negative depth estimates from PVNet, which was corrected for according to Section C.1 in the appendix.. Note that although the symmetry-aware reproj-s-px metric should be used on Occlusion LINEMOD (see Section 4.2.1), Oberweger et al. [13] report their results based on the reproj-px metric. We also want to mention that CDPN [16] perform well on Occlusion LINEMOD, but no quantitative numbers are reported.

Oberweger et al. [13]
PVNet [18]
PoseCNN [6]
+ DeepIM [3]
PVNet [18]
+ PPC (Ours)
add(-s)-d
reproj-s-px
5cm/-s
Table 1: Results on Occlusion LINEMOD. Note that [13] report results according to the reproj-px metric instead of reproj-s-px.

We also present results on the Occlusion LINEMOD dataset where we train purely on synthetic data, see Table 2. Our initial pose proposals are obtained from CDPN [16], which was the previous state-of-the-art for this set-up (cf. Benchmark for 6D Object Pose Estimation (BOP) evaluation server [23]). Also note that for these experiments only a subset of test frames is used, in compliance with BOP.

CDPN-synth [16]
CDPN-synth [16]
+ PPC-synth (Ours)
add(-s)-d
reproj-s-px
5cm/-s
Table 2: Results on Occlusion LINEMOD using only synthetic training data.

Finally, results on the LINEMOD dataset are presented in Table 3 with the purpose of providing a direct comparison with DeepIM [3] with identical initializations. We outperform DeepIM on all metrics using the same proposals from PoseCNN [6]. Note that although no results on LINEMOD for PoseCNN are reported in [6], predictions by PoseCNN are made available by [3]. The results are also good when compared to the state-of-the-art pose estimation methods of Li et al. [16] and Peng et al. [18] on LINEMOD.

PoseCNN [6]
PoseCNN [6]
+ DeepIM [3]
PoseCNN [6]
+ PPC (Ours)
add(-s)-d
reproj-px
5cm/
Table 3: Comparison with the refinement method of DeepIM and ours on LINEMOD with PoseCNN as initialization. Note that [3] reports results according to reproj-s-px and 5cm/-s metric instead of reproj-px and 5cm/.

4.4 Running Time

Experiments were run on a workstation with 64 GB RAM, Intel Core i7-8700K CPU, and Nvidia GTX 1080 Ti GPU. Our pose refinement pipeline takes on average 33 seconds per frame during inference for the iterations to be carried out, meaning 3 iterations / s. One way to improve on this could be by enabling analytical differentiation through differentiable rendering, although care should be taken in order to make sure that behaves smoothly enough, for instance, with a regularization scheme. Furthermore, rather than using iterative gradient-based optimization, gradient-free and sample-efficient approaches such as Bayesian optimization could be worth exploring, but is left as future research.

5 Conclusion

We have presented a novel rendering-based pose refinement method, which shows improved performance compared to previous refinement methods, and is robust to partial occlusions.

On the Occlusion LINEMOD benchmark, we initialize our method with pose proposals from PVNet [18], yielding state-of-the-art results for two out of three metrics on this competitive benchmark, while performing on-par with previous methods for the third metric. Furthermore, additional experiments on Occlusion LINEMOD show that our method works well also when trained purely on synthetic data, improving on the pose estimates of CDPN [16]. Finally, on the LINEMOD benchmark, previous refinement methods are outperformed for all metrics.

Appendix A Implementation Details

a.1 Network Architecture

Similar to [3], we use a pretrained FlowNetSimple optical flow network as backbone. They used the original model from Dosovitskiy et al. [24], while we use the FlowNet 2.0 version from Ilg et al. [25]. We flatten the encoder output feature maps, and feed them through three fully-connected layers, constituting our main branch. In contrast to [3]

, we use standard ReLU rather than leaky-ReLU activation functions. Furthermore, for the hidden layers we use

neurons, and apply dropout with probability. The final layer outputs one neuron, representing the average reprojection error estimate.

We also follow [3] in the approach of adding an auxiliary branch for foreground / background segmentation, by adding a 1-channel convolutional layer next to the optical flow prediction at level 4 (a.k.a. flow4). The optical flow prediction itself is however disregarded in order to simplify the pipeline, and we point out that the ablation study of [3] showed only a minor boost from including this auxiliary task.

Finally, the FlowNet 2.0 network is fed only the observed and rendered image patches as inputs, and no segmentation is provided, as for [3]. Experiments on re-training their network without segmentation input did however not result in any performance drop.

a.2 Training and Loss Function

We trained on the loss function , where is the error between the true and estimated average reprojection error, and is the binary cross-entropy loss of the foreground / background segmentation, averaged over all pixels. Furthermore, the target for average reprojection error was saturated at px, making sure that very large perturbation samples will not introduce a disturbance into the training, letting the network focus on achieving high precision within the range of reasonable perturbations.

One model was trained for each object, for epochs with batches sampled in each epoch. The learning rate was initialized to , and multiplied by every epochs. The batch size was and regularization was applied with weight decay .

a.3 Pose Proposal Sampling

For training, we generate pose proposals by perturbing the ground-truth pose in three different ways: (1) With

probability, a rotation around a random axis going through the object centre, whose magnitude is normally distributed with

and degrees. (2) With probability, a random lateral translation, normally distributed with and , where is the object diameter. (3) With

probability, a relative depth perturbation, sampled from a log-normal distribution with

and . This procedure and settings were found to work experimentally well.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Figure 3: Synthetically rendered training examples of observed images. Occlusion is simulated by rendering additional objects in front of the object of interest, or alternatively the corresponding region is replaced with background, effectively making it transparent.

a.4 Rendering Synthetic Training Data

In addition to real annotated training images, we augment the training examples by rendering synthetic observed images, illustrated in Figure 3. Like for the pose proposals, rendering is done using OpenGL. Phong shading is applied and we noticed a performance boost by taking specular effects into account. In the spirit of Domain Randomization [26], we sample variations in light source position as well as shading parameters such as ambient / diffuse / specular weights, and the whiteness / shininess parameters of the specular effects. No perturbations are applied on albedo.

Random images from Pascal VOC2012 [27] were used as background and Gaussian blur was applied on the border in order to blend foreground and background and reduce overfitting to border artifacts as proposed by [28]. Gaussian blur was also applied to the whole object of interest as advised by [29].

Furthermore, occluding objects of other object categories are sometimes rendered in front of the object of interest. A visible region of at least pixels is however ensured, otherwise occluders are resampled. In order to prevent overfitting towards the specific objects used for occlusion, occluded regions are replaced with background with a probability.

Finally, in the cases when we trained only on synthetic data, random noise in HSV-space was applied to the observed images.

Appendix B Further Results

b.1 Backbone Comparison

As a first experiment, we evaluated the performance when altering the backbone on Occlusion LINEMOD. In addition to using the FlowNet 2.0 backbone [25], the encoder of Zakharov et al. [4], based on ResNet-18 [30] and a siamese network was re-implemented for comparison. As can be seen in Table 4, the FlowNet 2.0 model outperforms the alternative, giving further evidence for the conclusion made by Li et al. [3] that a feature extractor trained for optical flow is useful also for this task.

ResNet-18 [30]
FlowNet 2.0 [25]
add(-s)-d
reproj-s-px
5cm/-s
Table 4: Comparison of our method on Occlusion LINEMOD for different backbones.

b.2 Illustration of Refinement Iterates

Figure 4 shows how our method gradually refines the pose for a few example frames of the Occlusion LINEMOD dataset, illustrated by the image patches of a few iterations. Despite the sub-optimal pose proposals from PVNet [18], the poses are accurately recovered.

Initial obs Initial rend Iteration 10 rend Final rend Final obs
Figure 4: Image patches during pose refinement iterations, for a few example frames of the Occlusion LINEMOD dataset.

b.3 Detailed Pose Refinement Results

Here we present detailed (per-object) pose refinement results and corresponding comparison with other methods.

Tables 5, 6 and 7 show results on Occlusion LINEMOD for the add(-s)-d, reproj-s-px and 5cm/-s metrics, respectively. The results of the corresponding experiments on synthetic data are reported in Tables 8, 9 and 10.

Similarly, results on LINEMOD are reported in Tables 11, 12 and 13, for the add(-s)-d, reproj-px and 5cm/ metrics, respectively.

The symmetric objects eggbox and glue are marked with , and for them add(-s)-d refers to add-s-d, and the reproj-s-px and 5cm/-s metrics also take their ambiguities through degree rotations around the "up"-axis into account.

Oberweger et al. [13]
PVNet [18]
PoseCNN [6]
+ DeepIM [3]
PVNet [18]
+ PPC (Ours)
ape
can
cat
driller
duck
eggbox
glue
holepuncher
Mean
Table 5: Results on Occlusion LINEMOD according to the add(-s)-d metric.
Oberweger et al. [13]
PVNet [18]
PoseCNN [6]
+ DeepIM [3]
PVNet [18]
+ PPC (Ours)
ape
can
cat
driller
duck
eggbox
glue
holepuncher
Mean
Table 6: Results on Occlusion LINEMOD according to the reproj-s-px metric. Note that [13] reports results according to reproj-px.
PVNet [18]
PoseCNN [6]
+ DeepIM [3]
PVNet [18]
+ PPC (Ours)
ape
can
cat
driller
duck
eggbox
glue
holepuncher
Mean
Table 7: Results on Occlusion LINEMOD according to the 5cm/-s metric. No results are reported by Oberweger et al. [13] on this metric.
CDPN-synth [16]
CDPN-synth [16]
+ PPC-synth (Ours)
ape
can
cat
driller
duck
eggbox
glue
holepuncher
Mean
Table 8: Synthetic results on Occlusion LINEMOD according to the add(-s)-d metric.
CDPN-synth [16]
CDPN-synth [16]
+ PPC-synth (Ours)
ape
can
cat
driller
duck
eggbox
glue
holepuncher
Mean
Table 9: Synthetic results on Occlusion LINEMOD according to the reproj-s-px metric.
CDPN-synth [16]
CDPN-synth [16]
+ PPC-synth (Ours)
ape
can
cat
driller
duck
eggbox
glue
holepuncher
Mean
Table 10: Synthetic results on Occlusion LINEMOD according to the 5cm/-s metric.
PoseCNN [6]
PoseCNN [6]
+ DeepIM [3]
PoseCNN [6]
+ PPC (Ours)
ape
benchvise
camera
can
cat
driller
duck
eggbox
glue
holepuncher
iron
lamp
phone
Mean
Table 11: Results on LINEMOD according to the add(-s)-d metric.
PoseCNN [6]
PoseCNN [6]
+ DeepIM [3]
PoseCNN [6]
+ PPC (Ours)
ape
benchvise
camera
can
cat
driller
duck
eggbox
glue
holepuncher
iron
lamp
phone
Mean
Table 12: Results on LINEMOD according to the reproj-px metric. Note that [3] reports results according to reproj-s-px.
PoseCNN [6]
PoseCNN [6]
+ DeepIM [3]
PoseCNN [6]
+ PPC (Ours)
ape
benchvise
camera
can
cat
driller
duck
eggbox
glue
holepuncher
iron
lamp
phone
Mean
Table 13: Results on LINEMOD according to the 5cm/ metric. Note that [3] reports results according to 5cm/-s.

Appendix C Additional Notes

c.1 Negative Depth Correction of Pose Proposals

We observed that the pose proposals from PVNet [18] sometimes have negative depth, and in this case we switched sign for the object center position, and rotated the object 180 degrees around the principal axis of the camera, in order to yield a feasible estimate with similar projection (the projection is identical for points on the plane which goes through the object center and is parallel to the principal plane of the camera). This correction is done both when reporting the results of [18], and when reporting the results of our refinement.

References

  • [1] Eric Brachmann, Alexander Krull, Frank Michel, Stefan Gumhold, Jamie Shotton, and Carsten Rother. Learning 6D object pose estimation using 3D object coordinates. In

    The European Conference on Computer Vision (ECCV)

    , September 2014.
  • [2] Fabian Manhardt, Wadim Kehl, Nassir Navab, and Federico Tombari. Deep model-based 6D pose refinement in RGB. In The European Conference on Computer Vision (ECCV), September 2018.
  • [3] Yi Li, Gu Wang, Xiangyang Ji, Yu Xiang, and Dieter Fox. DeepIM: Deep iterative matching for 6D pose estimation. In European Conference Computer Vision (ECCV), 2018.
  • [4] Sergey Zakharov, Ivan Shugurov, and Slobodan Ilic. DPOD: 6D pose object detector and refiner. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [5] Alex Kendall, Matthew Grimes, and Roberto Cipolla. PoseNet: A convolutional network for real-time 6-DoF camera relocalization. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
  • [6] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox.

    PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes.

    Robotics: Science and Systems (RSS), 2018.
  • [7] Thanh-Toan Do, Ming Cai, Trung Pham, and Ian Reid. Deep-6DPose: Recovering 6D object pose from a single RGB image, 2018.
  • [8] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, 2004. Second Edition.
  • [9] Mahdi Rad and Vincent Lepetit. BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [10] Wadim Kehl, Fabian Manhardt, Federico Tombari, Slobodan Ilic, and Nassir Navab. SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [11] Bugra Tekin, Sudipta N. Sinha, and Pascal Fua. Real-time seamless single shot 6D object pose prediction. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2018.
  • [12] Georgios Pavlakos, Xiaowei Zhou, Aaron Chan, Konstantinos G. Derpanis, and Kostas Daniilidis. 6-DoF object pose from semantic keypoints. In 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, May 2017.
  • [13] Markus Oberweger, Mahdi Rad, and Vincent Lepetit. Making deep heatmaps robust to partial occlusions for 3D object pose estimation. In The European Conference on Computer Vision (ECCV), September 2018.
  • [14] Omid Hosseini Jafari, Siva Karthik Mustikovela, Karl Pertsch, Eric Brachmann, and Carsten Rother. iPose: instance-aware 6D pose estimation of partly occluded objects. In ACCV, 2018.
  • [15] Kiru Park, Timothy Patten, and Markus Vincze. Pix2Pose: Pixel-wise coordinate regression of objects for 6D pose estimation. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [16] Zhigang Li, Gu Wang, and Xiangyang Ji. CDPN: Coordinates-based disentangled pose network for real-time RGB-based 6-DoF object pose estimation. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [17] Yinlin Hu, Joachim Hugonot, Pascal Fua, and Mathieu Salzmann. Segmentation-driven 6D object pose estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [18] Sida Peng, Yuan Liu, Qixing Huang, Xiaowei Zhou, and Hujun Bao. PVNet: Pixel-wise voting network for 6DoF pose estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [19] Henning Tjaden, Ulrich Schwanecke, and Elmar Schomer. Real-time monocular pose estimation of 3D objects using temporally consistent local color histograms. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [20] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. Neural 3d mesh renderer. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [21] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  • [22] Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Stefan Holzer, Gary Bradski, Kurt Konolige, and Nassir Navab. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Computer Vision – ACCV 2012, pages 548–562. Springer Berlin Heidelberg, 2013.
  • [23] Tomáš Hodaň, Frank Michel, Eric Brachmann, Wadim Kehl, Anders Glent Buch, Dirk Kraft, Bertram Drost, Joel Vidal, Stephan Ihrke, Xenophon Zabulis, Caner Sahin, Fabian Manhardt, Federico Tombari, Tae-Kyun Kim, Jiří Matas, and Carsten Rother. BOP: Benchmark for 6D object pose estimation. European Conference on Computer Vision (ECCV), 2018.
  • [24] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. FlowNet: Learning optical flow with convolutional networks. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
  • [25] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. FlowNet 2.0: Evolution of optical flow estimation with deep networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [26] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017.
  • [27] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
  • [28] Debidatta Dwibedi, Ishan Misra, and Martial Hebert. Cut, paste and learn: Surprisingly easy synthesis for instance detection. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [29] Stefan Hinterstoisser, Vincent Lepetit, Paul Wohlhart, and Kurt Konolige.

    On pre-trained image features and synthetic images for deep learning.

    In The European Conference on Computer Vision (ECCV) Workshops, September 2018.
  • [30] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.