pwc
Papers with code. Sorted by stars. Updated weekly.
view repo
We present a novel approach for modelbased 6D pose refinement in color data. Building on the established idea of contourbased pose tracking, we teach a deep neural network to predict a translational and rotational update. At the core, we propose a new visual loss that drives the pose update by aligning object contours, thus avoiding the definition of any explicit appearance model. In contrast to previous work our method is correspondencefree, segmentationfree, can handle occlusion and is agnostic to geometrical symmetry as well as visual ambiguities. Additionally, we observe a strong robustness towards rough initialization. The approach can run in realtime and produces pose accuracies that come close to 3D ICP without the need for depth data. Furthermore, our networks are trained from purely synthetic data and will be published together with the refinement code to ensure reproducibility.
READ FULL TEXT VIEW PDF
Stateoftheart methods for 6D object pose estimation typically train a...
read it
In recent years, considerable progress has been made for the task of rig...
read it
In this paper, we present an accurate yet effective solution for 6D pose...
read it
We present a novel method for detecting 3D model instances and estimatin...
read it
We present a novel 3D pose refinement approach based on differentiable
r...
read it
We present a novel method for realtime pose and shape reconstruction of...
read it
Papers with code. Sorted by stars. Updated weekly.
The problem of tracking CAD models in images is frequently encountered in contexts such as robotics, augmented reality (AR) and medical procedures. Usually, tracking has to be carried out in the full 6D pose, i.e. one seeks to retrieve both the 3D metric translation as well as the 3D rotation of the object in each frame. Another typical scenario is pose refinement, where an object detector provides a rough 6D pose estimate, which has to be corrected in order to provide a better fit (Figure 1). The usual difficulties that arise include viewpoint ambiguities, occlusions, illumination changes and differences in appearance between the model and the object in the scene. Furthermore, for tracking applications the method should also be fast enough to cover large interframe motions.
Most related work based on RGB data can be roughly divided into sparse and regionbased methods. The former methods try to establish local correspondences between frames [40, 23] and work well for textured objects, whereas latter ones exploit more holistic information about the object such as shape, contour or color [27, 8, 37, 38] and are usually better suited for textureless objects. It is worth mentioning that mixtures of the two sets of methods have been proposed as well [30, 6, 31, 24]. Recently, methods that use only depth [34] or both modalities [21, 18, 10] have shown that depth can make tracking more robust by providing more clues about occlusion and scale.
a) Input Image  b) Initial pose hypotheses  c) Poses after 10 iterations 
This work aims to explore how RGB information alone can be sufficient to perform visual tasks such as 3D tracking and 6DegreeofFreedom (6DoF) pose refinement by means of a Convolutional Neural Network (CNN). While this has already been proposed for camera pose and motion estimation
[19, 43, 41, 39], it has not been wellstudied for the problem at hand.As a major contribution we provide a differentiable formulation of a new visual loss that aligns object contours and implicitly optimizes for metric translation and rotation. While our optimization is inspired by regionbased approaches, we can track objects of any texture or shape since we do not need to model global [27, 37, 18] or local appearance [11, 38]. Instead, we show that we can do away with these handcrafted approaches by letting the network learn the object appearance implicitly. We teach the CNN to align contours between synthetic object renderings and scene images under changing illumination and occlusions and show that our approach can deal with a variety of shapes and textures. Additionally, our method allows to deal with geometrical symmetries and visual ambiguities without manual tweaking and is able to recover correct poses from very rough initializations.
Notably, our formulation is parameterfree and avoids typical pitfalls of handcrafted tracking or refinement methods (e.g. via segmentation or correspondences + RANSAC) that require tedious tuning to work well in practice. Furthermore, like with depthbased approaches such as ICP, we are robust to occlusion and produce results which come close to RGBD methods without the need for depth data, making it thus very applicable to the domains of AR, medical and robotics.
Since the field of tracking and pose refinement is vast, we will only focus here on works that deal with CAD models in RGB data. Early methods in this field used either 2D3D correspondences [29, 30] or 3D edges [9, 35, 32] and fit the model in an ICP fashion with iterative, projective update steps. Successive methods in this direction managed to obtain improved performance [6, 31]. Additionally, other works focused on tracking the contour densely via levelsets [3, 8].
Based on these works, [27] presented a new approach that follows the projected model contours to estimate the 6D pose update. In a followup work [26], the authors extended their method to simultaneously track and reconstruct a 3D object on a mobile phone in realtime. The authors from [37] improved the convergence behavior with a new optimization scheme and presented a realtime implementation on a GPU. Consequently, [38] showed how to improve the color segmentation by using local color histograms over time. Orthogonally, the work [18] approximates the model pose space to avoid GPU computations and enables realtime performance on a single CPU core. All these approaches share the property that they rely on handcrafted segmentation methods that can fail in the case of sudden appearance changes or occlusion. We instead want to entirely avoid handcrafting manual appearance descriptions.
Another set of works tries to combine learning with simultaneous detection and pose estimation in RGB. The method presented in [17] couples the SSD paradigm [22] with pose estimation to produce 6D pose pools per instance which are then refined with edgebased ICP. On the contrary, the approach from [5]
uses autocontext Random Forests to regress object coordinates in the scene that are used to estimate poses. In
[28] a method is presented that instead regresses the projected 3D bounding box and recovers the pose from these 2D3D correspondences whereas the authors in [25] infer keypoint heatmaps that are then used for 6D pose computation. Similarly, the 3D Interpreter Network [42] infers heatmaps for categories and regresses projection and deformation to align synthetic with real imagery. In the work [10], a deep learning approach is used to track models in RGBD data. Their work goes along similar grounds but we differ in multiple ways including data generation, energy formulation and their use of RGBD data. In particular, we show that a naive formulation of pose regression does not work in the case of symmetry which is often the case for manmade objects.
We also find common ground with Spatial Transformer Networks in 2D
[16] and especially 3D [2], where the employed network architecture contains a submodule to transform the 2D/3D input via a regressed affine transformation on a discrete lattice. Our network instead regresses a rigid body motion on a set of continuous 3D points to minimize the visual error.In this section we explain our approach to train a CNN to regress a 6D pose refinement from RGB information alone. We design the problem in such a way that we supply two color patches ( and ) to the network in order to infer a translational and rotational update. In Figure 2 we depict our pipeline and show a typical scenario where we have a 6D hypothesis (coming from a detector or tracker) that is not correctly aligned. We want to estimate a refinement such that eventually the updated hypothesis overlaps perfectly with the real object.
We first want to discuss our patch extraction strategy. Provided a CAD model and a 6D pose estimate in camera space, we create a rendering and compute the center of the associated bounding box of the hypothesis around which we subsequently extract and . Since different objects have varying sizes and shapes it is important to adapt the cropping size to the spatial properties of the specific object. The most straightforward method would be to simply crop and
with respect to a tight 2D bounding box of the rendered mask. However, when employing such metric crops, the network loses the ability to robustly predict an update along the Zaxis: indeed, since each crop would almost entirely fill out the input patch, no estimate of the difference in depth can be drawn. Due to this, we explicitly calculate the spatial extent in pixels at a minimum metric distance (with some added padding) and use this as a fixedsize ’window’ into our scene. In particular, prior to training, we render the object from various different viewpoints, compute their bounding boxes, and take the maximum width or height of all produced bounding boxes.
To create training data we randomly sample a ground truth pose of the object in camera coordinates and render the object with that pose onto a random background to create a scene image. To learn pose refinement, we perturb the true pose to get a noisy version and render a hypothesis image. Given those two images, we cut out patches and with the strategy mentioned above.
Provided these patches, we now want to infer a separate correction of the perturbed pose such that
(1) 
Due to the difficulty of optimizing in SO(3) we parametrize via unit quaternions to define a regression problem, i.e. similar to what [20] proposed for camera localization or [10] for model pose tracking:
(2) 
In essence, this energy weighs the numerical error in rotation against the one in translation by means of the hyperparameter and can be optimized correctly when solutions are unique (as is the case, e.g., of camera pose regression). Unfortunately, the above formulation only works for injective relations where an input image pair gets always mapped to the same transformation. In the case of onetomany mappings, i.e. an image pair can have multiple correct solutions, the optimization does not converge since it is pulled into multiple directions and regresses the average instead. In the context of our task, visual ambiguity is common for most manmade objects because they are either symmetric or share the same appearance from multiple viewpoints. For these objects there is a large (sometimes infinite) set of refinement solutions that yield the same visual result. In order to regress and under ambiguity, we therefore propose an alternative formulation.
Instead of explicitly minimizing an ambiguous error in transformation, we strive to minimize an unambiguous error that measures similarity in appearance. We thus treat our search for the pose refinement parameters as a subproblem inside another proxy loss that optimizes for visual alignment. While there are multiple ways to define a similarity measure, we seek one that fulfills the following properties: 1) invariant to symmetric or indistinguishable object views, 2) robust to color deviation, illumination change and occlusion as well as 3) smooth and differentiable with respect to the pose.




To fulfill the first two properties we propose to align the object contours. Tracking the 6D pose of objects via projective contours has been presented before [18, 37, 27] but, to the best of our knowledge, has not so far been introduced in a deep learning framework. Contour tracking allows to reduce the difficult problem of 3D geometric alignment to a simpler task of 2D silhouette matching by moving through a distance transform, avoiding explicit correspondence search. Furthermore, a physical contour is not affected by deviations in coloring or lighting which makes it even more appealing for pure RGB methods. We refer to Figure 3 for a training example and the visualization of the contours we align.
Fulfilling smoothness and differentiability is more difficult. An optimization step for this energy requires to render the object with the current pose hypothesis for contour extraction, estimate the similarity with the target contour and backpropagate the error gradient such that the refined hypothesis’ projected contour is closer in the next iteration. Unfortunately, backpropagating through a rendering pipeline is nontrivial (due to, among others, zbuffering and rasterization). We therefore propose here a novel formulation to drive the network optimization successfully through the ambiguous 6D solution space. We employ an idea, introduced in [18], that allows us to use an approximate contour for optimization without iterative rendering. When creating a training sample, we use the depth map of the rendering to compute a 3D point cloud in camera space and sample a sparse point set on the contour, denoted as . The idea is then to transform these contour points with the current refinement estimate , followed by a projection into the scene. This mimics a rendering plus contour extraction at no cost and allows for backpropagation.
For a given training sample with input patch pair , a distance transform of the scene contour and hypothesis contour points , we define the loss
(3) 
with being the conjugate quaternion. With the formulation above we also free ourselves from any balancing issue between quaternion and translation magnitudes as in a standard regression formulation.
Minimizing the above loss with a gradient descent step forces a step towards the 0level set of the distance transform. We basically tune the network weights to rotate and translate the object in 6D to maximize the projected contour overlap. While this works well in practice, we have observed that for certain objects and stronger pose perturbations the optimization can get stuck in local minima. This occurs when our loss drives the contour points into a configuration where the distance transform allows them to settle in local valleys. To remedy this problem we introduce a bidirectional loss formulation that simultaneously aligns the contours of hypothesis as well as scene onto each other, coupled and constrained by the same pose update. We thus have an additional term that runs into the opposite direction:
(4) 
This final loss does not only alleviate the locality problem but has also shown to lead to faster training overall. We therefore chose this energy for all experiments.
We give a schematic overview of our network structure in Figure 2 and provide here more details. In order to ensure fast inference, our network follows a fullyconvolutional design. The network is fed with two input patches representing the cropped scene image and cropped render image . Both patches run in separate paths through the first levels of an InceptionV4 [33]
instance to extract lowlevel features. Thereafter we concatenate the two feature tensors, downsample by employing maxpooling as well as a strided
convolution, and concatenate the results again. After two InceptionA blocks we branch off into two separate paths for the regression of rotation and translation. In each we employ two more InceptionA blocks before downsampling by another strided convolution. The resulting tensors are then convolved with either a kernel to regress a 4D quaternion or akernel to predict a 3D update translation vector.
Initial experiments showed clearly that training the network from scratch made it impossible to bridge the domain gap between synthetic and real images. Similarly to [17, 13] we found that the network focused on specific appearance details of the rendered CAD models and the performance on real imagery collapsed drastically. Synthetic images usually possess very sharp edges and clear corners. Since the first layers learn lowlevel features they overfit quickly to this perfect rendered world during training. We therefore copied the first five convolutional blocks from a pretrained model and froze their parameters. We show the improvements in terms of generalization to real data in the supplement.
Further, we initialize the final regression layers such that the bias equals identity quaternion and zero translation whereas the weights are given a small Gaussian noise level of . This ensures that we start refinement from a neutral pose, which is crucial for the evaluation of the projective visual loss.
While our approach produces very good refinements in a single shot we decided to also implement an iterative version where we run the pose refinement multiple times until the regressed update falls under a threshold.
We ran our method with TensorFlow 1.4
[1] on a i75820K@3.3GHz with an NVIDIA GTX 1080. For all experiments we ran the training with 100k iterations, a batch size of 16 and ADAM with a learning rate of . Furthermore, we fixed the number of 3D contour points per view to . Additionally, our method is realtime capable since one iteration requires approximately 25ms during testing.To evaluate our method, we carried out experiments on three, both synthetic and real, datasets and will convey that our method can come close to RGBD based approaches. In particular, the first dataset, referred to as ’Hinterstoisser’, was introduced in [12] and consists of 15 sequences each possessing approximately 1000 images with clutter and mild occlusion. Only 13 of these provide watertight CAD models and we therefore, like others before us, skip the other two sequences. The second one, which we refer to as ’Tejani’, was proposed in [36] and consists of six mostly semisymmetric, textured objects each undergoing different levels of occlusion. In contrast to the first two real datasets, the latter one, referred to as ’Choi’ [7], consists of four synthetic tracking sequences.
In essence, we will first conduct some selfevaluation in which we illustrate our convergence properties with respect to different degrees of pose perturbation on real data. Then we show our method when applied to object tracking on ’Choi’. As a second application, we compare our approach to a variety of other stateoftheart RGB and RGBD methods by conducting experiments in pose refinement on ’Hinterstoisser’, the ’Occlusion’ dataset and ’Tejani’. Finally, we depict some failure cases and conclude with a qualitative categorylevel experiment.
We study the convergence behavior of our method by taking correct poses, applying a perturbation by a certain amount and measure how well we can refine back to the original pose. To this end, we use the ’Hinterstoisser’ dataset since it provides a lot of variety in terms of both colors and shapes. For each frame of a particular sequence we perturb the ground truth pose either by an angle or by a translation vector. In Figure 4 we illustrate our results for the ’ape’ and the ’bvise’ objects and kindly refer the reader to the supplement for all graphs. In particular, we report our results for increasing degrees of angular perturbations from 5°to 45°and for increasing translation perturbations from 0 to 1 relative to the object’s diameter. We define divergence if the refined rotation is above 45°in error or the refined translation larger than half of the object’s diameter and we employ 10 iterative steps to maximize our possible precision.
In general, our method can recover poses very robustly even under strong perturbations. Even for the extreme case of rotating the ’bvise’ with 45°we can refine back to an error less than 5°in more than 60% of all trials, and to an error less than 10°in more than 80% of all runs. Additionally, our approach only diverged for less than 1%. However, for the more difficult ’ape’ object our numbers worsen. In particular, in almost 50% of the cases we were not able to rotate back the object to an error of less than 10%. Yet, this can be easily explained by the object’s appearance. The ’ape’ is a rather small object with poor texture and nondistinctive shape, which does not provide enough information to hook onto whereas the ’bvise’ is large and rich in appearance. It is noteworthy that the actual divergence behavior in rotation is similar for both and that the visual alignment for the ’ape’ is often very good despite the error in pose.
The translation error correlates almost linearly between initial and final pose. We also observe an interesting tendency starting from perturbation levels at around 0.6 after which the results can be divided up into two distinct sets: either the pose diverges or the error settles on a certain level. This implies that certain viewpoints are easy to align as long as they have a certain visual overlap to begin with, rather independent of how strong we perturb. Other views instead are more difficult with higher perturbations and diverge from some point on.


(a) Errors on ’Choi’ in respect to others.  (b) Tracking quality compared to [37]. 
As a first use case we evaluated our method as a tracker on the ’Choi’ benchmark [7]. This RGBD dataset consists of four synthetic sequences and we present detailed numbers in Figure 5. Note that all other methods utilize depth information. We decided for this dataset because it is very hard for RGBonly methods: it is poor in terms of color and the objects are of (semi)symmetric nature. To provide an interesting comparison we also qualitatively evaluated against our tracker implementation of [37]. While their method is usually robust for textureless objects it diverges on 3 sequences which we show and for which we provide reasoning^{1}^{1}1The authors acknowledged our conclusions in correspondence. in Figure 5 and in the supplementary material. In essence, except for the ’Milk’ sequence we can report very good results. The reason why we performed comparably bad on the ’Milk’ resides in the fact that our method already treats it as a rather symmetric object. Thus, sometimes it rotates the object along its Yaxis, which has a negative impact on the overall numbers. In particular, while already being misaligned, the method still tries to completely fill the object into the scene, thus, it slightly further rotates and translates the object. Referring to the remaining objects, we can easily outperform PCL’s ICP for all objects and also Choi and Christensen [7] for most of the cases. Compared to Krull [21], which is a learned RGBD approach, we perform better for some values and worse for others. Note that our translation error along the Zaxis is quite high. Since the difference in pixels is almost nonexistent when the object is moved only a few millimeters, it is almost impossible to estimate the exact distance of the object without leveraging depth information. This has also been discussed in [15] and is especially true for CNNs due to pooling operations.
ape  bvise  cam  can  cat  driller  duck  box  glue  holep  iron  lamp  phone  total  
No Refinement  0.64  0.65  0.71  0.72  0.63  0.62  0.65  0.64  0.64  0.69  0.71  0.63  0.69  0.66 
2D Edgebased ICP  0.73  0.67  0.73  0.76  0.68  0.67  0.72  0.73  0.72  0.71  0.74  0.67  0.70  0.71 
3D Cloudbased ICP  0.86  0.88  0.91  0.87  0.87  0.85  0.83  0.84  0.75  0.77  0.85  0.84  0.81  0.84 
Ours  0.83  0.83  0.75  0.87  0.79  0.85  0.87  0.88  0.85  0.82  0.85  0.80  0.83  0.83 
(a) Absolute pose errors on [12] and [4].  (b) VSS scores for each sequence of [36]. 
This set of experiments analyzes our performance in a detection scenario where an object detector will provide rough 6D poses and the goal is to refine them. We decided to use the results from SSD6D [17], an RGBbased detection method, that outputs 2D detections with a pool of 6D pose estimates each. The authors publicly provide their trained networks and we use them to detect and create 6D pose estimates which we feed into our system. Tables 1, 2 (a) and (b) depict our results for the ’Hinterstoisser’, ’Occlusion’ and the ’Tejani’ dataset using different metrics. We maximally ran 5 iterations of our method, yet, we also stopped if the last update was less than 1.5°and 7.5mm. Since our method is particularly strong at recovering from bad initializations, we employ the same RGBverification strategy as SSD6D. However, we apply it before conducting the refinement, since in contrast to them, we can also deal with imperfect initializations, as long as they are not completely misaligned. We report our errors with the VSS metric (which is VSD from [14] with ) that calculates a visual 2D error as the pixelwise overlap between the renderings of ground truth pose and estimated pose. Furthermore, to compare better to related work, we also use the ADD score [12] to measure a 3D metrical error as the average point cloud deviation between real pose and inferred pose when transformed into the scene. A pose is counted as correct if the deviation is less than a th of the object diameter.
Referring to ’Hinterstoisser’ with the VSS metric, we can strongly improve the stateoftheart for most objects. In particular, for the case of RGB only, we can report an average VSS score of 83%, which is an improvement of impressive and can thus successfully bridge the gap between RGB and RGBD in terms of pose accuracy.
Except for the ’cam’ and the ’cat’ object our results are on par with or even better than SSD6D + 3D refinement. ICP relies on good correspondences and robust outlier removal which in turn requires very careful parameter tuning. Furthermore, ICP is often unstable for rougher initializations. In contrast, our method learns refinement endtoend and is more robust since it adapts to the specific properties of the object during training. However, due to this, our method requires meshes of good quality. Hence, similar to SSD6D we have especially problems for the ’cam’ object since the model appearance strongly differs from the real images which exacerbates training. Also note that their 3D refinement strategy uses ICP for each pose in the pool, followed by a verification over depth normals to decide for the best pose. Our method instead uses a simple check over image gradients to pick the best.
With respect to the ADD metric we fall slightly behind the other stateoftheart RGB methods [5, 28]. We got the 3DICP refined poses from the SSD6D authors and analyzed the errors in more detail in Table 2(a). We see again that we have bigger errors along the Zaxis, but less errors along X and Y. Unfortunately, the ADD metric penalizes this deviation overly strong. Interestingly, [5, 28] have better scores and we reason this to come from two facts. The datasets are annotated via ICP with 3D models against depth data. Unfortunately, inaccurate intrinsics and the sensor registration error between RGB and D leads to an inherent mismatch where the ICP 6D pose does not always align perfectly in RGB. Purely synthetic RGB methods like ours or [17] suffer from (1) a domain gap in terms of texture/shape and (2) the dilemma that better RGB performance can worsen results when comparing to that ’true’ ICP pose. We suspect that [5, 28] can learn this registration error implicitly since they train on real RGB cutouts with associated ICP pose information and thus avoid both problems. We often observe that our visuallyperfect alignments in RGB fail the ADD criterion and we show examples in the supplement. Since our loss actually optimizes a form of VSS to maximize contour overlap, we can expect the ADD scores to go up only when perfect alignment in color equates perfect alignment in depth.
Eventually, referring to the ’Occlusion’ dataset, we can report a strong improvement compared to the original numbers from SSD6D, despite the presence of strong occlusion. In particular, while the rotational error decreased by approximately 8 degrees, the translational error dropped by 4mm along ’X’ and ’Y’ axes and by 28mm along ’Z’. Thus, we can increase ADD from 6.2% up to 28.5%, which demonstrates that we can deal with strong occlusion in the scene.
For ’Tejani’ we decided to show the improvement over networks trained with a standard regression loss (MSE). Additionally, we reimplemented the RGB tracker from [37] and were kindly provided with numbers from the authors of the RGBD tracker from [18] (see Figure 6). Since the dataset mostly consists of objects with geometric symmetry, we do not measure absolute pose errors here but instead report our numbers with the VSS metric. The MSEtrained networks constantly underperform since the dataset models are of symmetric nature which in turn leads to a large difference of 14% in comparison to our visual loss. This result stresses the importance of correct symmetry entangling during training. The RGB tracker was not able to refine well due to the fact that the color segmentation was corrupted by either occlusions or imperfect initialization. The RGBD tracker, which builds on the same idea, performed better because it uses the additional depth channel for segmentation and optimization.
We were curious to find out whether our approach can generalize beyond a specific CAD model, given that many objects from the same category share similar appearance and shape properties. To this end, we conducted a final qualitative experiment (see Figure 7) where we collected a total of eight CAD models of cups, mugs and a bowl and trained simultaneously on all. During testing we then used this network to track new, unseen models from the same category. We were surprised to see that the approach has indeed learned to metrically track previously unseen but nonetheless similar structures. While the poses are not as accurate as for the singleinstance case, it seems that one can indeed learn the projective relation of structure and how it changes under 6D motion, provided that at least the projection functions (i.e. camera intrinsics) are constant. We show the full sequence in the supplementary material.
Figure 8 illustrates two known failure cases where the left image of each pair represents initialization and the right image the refined result. Although we train with occlusion certain occurrences can worsen our refinement nonetheless. While two ’milk’ instances were refined well despite occlusion, the left ’milk’ instance could not be recovered correctly. The network assumes the object to end at the yellow pen and only maximizes the remaining pixelwise overlap. Besides occlusion, objects of similar color and shape can in rare cases lead to confusion. As shown in the right pair, the network mistakenly assumed the stapler, instead of the cup, to be the real object of interest.
We believe to have presented a new approach towards 6D model tracking in RGB with the help of deep learning and we demonstrated the power of our approach on multiple datasets and for the scenarios of pose refinement and for instance/category tracking. Future work will include investigation towards generalization to other domains, e.g. the suitability towards visual odometry.
Acknowledgments We would like to thank Toyota Motor Corporation for funding and supporting this work.
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: Largescale machine learning on heterogeneous systems. In: OSDI (2016),
http://download.tensorflow.org/paper/whitepaper2015.pdf
Comments
There are no comments yet.