1 Introduction
Tracking objects in image sequences is a relevant problem in computer vision with significant applications in several fields, such as robotics, augmented reality, medical navigation and surveillance. For most of these applications, object tracking has to be carried out in 3D,
an algorithm has to retrieve the full 6D pose of each model in every frame. This is quite challenging since objects can be ambiguous in their pose and can undergo occlusions as well as appearance changes. Furthermore, trackers must also be fast enough in order to cover larger interframe motions.In the case of 3D object tracking from color images, the related work can be roughly divided into sparse methods that try to establish and track local correspondences between frames [30, 16], and regionbased methods that exploit more holistic information about the object such as shape, contour or color [18, 5], although mixtures of both do exist [22, 2, 23]. While both directions have their respective advantages and disadvantages, the latter performs better for textureless objects, which is our focus here. One popular methodology for textureless object tracking relies on the idea of aligning the projected object contours to a segmentation in each frame. While initially shown for arbitrary shapes [4, 1], more recent works put emphasis on tracking 3D colored models [5, 18, 31, 29].
With the advent of commodity RGBD sensors, these methods have then been further extended to depth images for simultaneous tracking and reconstruction [20, 19]. Indeed, exploiting RGBD is beneficial since imagebased contour information and depth maps are complimentary cues, one being focused on object borders, the other on objectinternal regions. This has been exploited for 3D object detection and tracking [15, 9, 12], as well as to improve camera tracking in planar indoor environments [32].
From a computational perspective, several stateoftheart trackers leverage the GPU for realtime performance [20, 19, 29]. Nevertheless, there is a strong interest towards decreasing the computational burden and generally avoiding GPU usage, motivated by the fact that many relevant applications require trackers to be lightweight [11, 17].
Taking this all into consideration, we propose a framework that allows accurate tracking of multiple 3D models in color and depth. Unlike related works [18, 29] our method is lightweight, both in computation (requiring only one CPU core) and in memory footprint. To achieve this, we propose to prerender a given target 3D model from various viewpoints and extract occluding contour and interior information in an offline step. This avoids time consuming online renderings and consequently results in a fast tracking approach. Furthermore, we do not compute the terms of our objective function densely but introduce sparse approximations which gives a tremendous performance boost, allowing realtime tracking of multiple instances. While the proposed contourbased tracking works well in RGB images, in the case of available depth information, we propose two additions: firstly, we make the colorbased segmentation more robust by incorporating cloud information and secondly, we define a new tracking framework where a novel planetopoint error on cloud data and a contour error are simultaneously steering the pose alignment.

As a foundation of our work, we propose to prerender the model view space and extract contour and interior information in an offline step to avoid online rendering, making our method a pure CPUbased approach.

We evaluate all terms sparsely instead of densely which gives a tremendous performance boost.

Given RGBD data, we show how to improve contourbased tracking by incorporating cloud information into the color contour estimation. Additionally, we present a new joint tracking that incorporates a novel planetopoint error and a contour error,
i.e. color and depth points are simultaneously steering the pose alignment.
Therefore, our method can deal with challenges typically encountered in tracking as depicted in the Figure 1. In the results section, we evaluate our approach both quantitatively and qualitatively and compare it to the related approaches reporting better accuracy at higher speeds.
2 Related work
We confine ourselves to the field of 3D model tracking in color and depth. Earlier works in this field employ either 2D3D correspondences [21, 22] or 3D edges [6, 28, 24] and fit the model in an ICP fashion, i.e. without explicitly computing a contour. While successive methods along this direction managed to obtain improved performance [2, 23], another set of works solely focused on tracking densely the contour by evolving a levelset function [1, 5]. In particular, Bibby et al. [1] aligned the current evolving 2D contour to a color segmentation, and demonstrated improved robustness when computing a posterior distribution in color space.
Based on this work, the first realtime contour tracker for 3D models was presented by Prisacariu et al. [18], where the contour is determined by projecting the 3D model with its associated 6D pose onto the frame. Then, the alignment error between segmentation and projection drives the update of the pose parameters via gradient descent. In a followup work [17], the authors extend their method to simultaneously track and reconstruct a 3D object on a mobile phone in realtime. They circumvent GPU rendering by hierarchically raycasting a volumetric representation and speed up pose optimization by exploiting the phone’s inertial sensor data. Tjaden et al. [29] build on the original framework and extend it with a new optimization scheme that employs GaussNewton together with a twist representation. Additionally, they handle occlusions in a multiobject tracking scenario, making the whole approach more robust in practice. The typical problem of these methods is their fragile segmentation based on color histograms, which can fail easily without using an adaptive appearance model, or when tracking in scenes where the background colors match the objects’ colors. Based on this, [31] explores a boundary term to strengthen contours, whereas [8] improves the segmentation with local appearance models.
When it comes to temporal tracking from depth data, there are mostly only works based on energy minimization of sparse and dense data [14, 3, 20, 19, 32, 25], or based on learning, such as the work from Tan et al. [26, 27] and Krull et al. [13]. Among these, the closest to us are the works from Ren et al. [20, 19], which track and simultaneously reconstruct a 3D levelset embedding from depth data, following a colorbased segmentation. We move orthogonally by looking at depth as an additional modality towards reliable segmentations and to improve tracking via a novel ICP term in a joint formulation.
3 Methodology
We will first introduce the notion of contour tracking in RGBD images. There we formalize a novel foreground posterior probability composed of both color and cloud data. This is followed by the complete energy formulation over joint contour and cloud alignment. Finally, we then explain our further contributions to boost runtime performance via our proposed approximation schemes.
3.1 Tracking via implicit contour embeddings
In the spirit of [18, 29], we want to track a (meshed) 3D model in camera space such that its projected silhouette aligns perfectly with a 2D segmentation in the image with being the image domain. Given a silhouette (i.e. foreground mask) , we can infer a contour to compute a signed distance field (SDF) s.t.
(1) 
where a pixel tells the signed distance to the closest contour point and is the set of background pixels.
We follow the PWP3D tracker energy formulation [18] in which the pixelwise (posterior) probability of a contour, embedded as , given color image , is defined as
(2) 
The terms are modeling posterior distributions for foreground and background membership based on color, in practice computed from normalized RGB histograms, whereas represents a smoothed version of the Heaviside step function defined on . To get an impression of the involved terms, we refer to Figure 2.
In practice, this posterior works well in cases where the foreground and background are different and starts failing when the color of the parts of the background get close to the color of the target object. To circumvent this problem, we propose to use depth information coming from the RGBD sensor at the sparse sample points on the foreground of the object as supplementary information.
Let us define a depth map and its cloud map . Furthermore, we conduct a fast depth map inpainting such that we remove all unknown values in both and . Our goal is now twofold: we want to make the posterior image more robust by including cloud information into the probability estimation, and we want to extend the tracking energy to the new data.
3.2 Pixelwise color/cloud posterior
In practice, color histograms are very error prone and fail quickly for textured/glossy objects and colorful backgrounds, even with adaptive histogram during tracking. We therefore propose a new robust pixelwise posterior to be used for contour alignment in Eq. 2 when additional depth data is provided. The notion we bring forward is that color posteriors alone are misleading and should be reweighted with their spatial proximity to the model.
Given a model with pose , we infer silhouette region and background and now define their probabilities not only based on a given pixel color but also an associated cloud point . We start from estimating the probability of a pose and its silhouette, provided color and cloud data, and define as a binary foreground/background variable. For tractability, we assume that a pose and its silhouette are independent, given and :
Assuming that all pixels are independent, that there is no correlation between the color of a pixel and its cloud point, and that is uniform, we reach^{1}^{1}1We refer the reader to the supplement for the full derivation.:
(3)  
(4) 
While are usually computed from color histograms, it is not directly clear of how to compute since it assumes inference for 3D data from an image mask while the term is infeasible in general. We thus drop both terms (i.e. set both to uniform) and finally define
(5) 
The weighting term , which gives the likelihood of a cloud point to be on the model, can be computed in multiple ways. Instead of simply taking the distance to the model centroid, we want a more precise measure that gives back the distance to the closest model point. Since even logarithmic nearestneighbor lookups would be costly here, we use an idea first presented in [7]. One can precompute a distance transform in a volume around the model to facilitate a constant nearestneighbor lookup function, , and we exploit this approach by bringing each scene cloud point into the local object frame and efficiently calculate a pixelwise weighting on the image plane with a Gaussian kernel:
(6) 
Here, steers how much deviation we allow a point to have from a perfect alignment since we want to deal with pose inaccuracies as well as depth noise.
We can see the color posterior at work plus combination of the cloudbased weighting term in Figure 3. While the former gives a segmentation based on appearance alone, the latter takes complementary spatial distances into account, rendering contourbased tracking more robust.
3.3 Joint contour and cloud tracking
We introduce the notion of a combined tracking approach where 2D contour points and 3D cloud points are jointly driving the pose update. In essence, we seek a weighted energy of the form
(7) 
where is balances both partial energies since they can deviate in the number of samples as well as numerical scale. We want to mention the work [15] which formulate a similar optimization problem.
3.3.1 Contour energy
Assuming pixelwise independence and taking the negative logarithm of Eq. 2, we get a contour energy functional
(8) 
In order to optimize the energy in respect to a change in model pose, we employ a GaussNewton scheme over twist coordinates, similarly to Tjaden et al. [29]
. We define a twist vector
that provides a minimal representation for our sought transformation and its Lie algebra twist as well as its exponential mapping to the Lie group :(9) 
We abuse notation s.t. expresses the transformation of applied to a 3D point . Assuming only infinitesimal change in transformation we derive the energy^{2}^{2}2For brevity, we moved the full derivation into the supplement. in respect to a point undergoing a screw motion as
(10) 
A visualization of some terms can be seen in Figure 2. While and can be written in analytical form, resolves essentially to a smoothed Dirac delta whereas can be implemented via simple central differences. In total, we obtain one Jacobian per pixel and solve a leastsquares problem
(11) 
via Cholesky decomposition. Given the model pose at time , we update via the exponential mapping
(12) 
3.3.2 ICP Energy
In terms of ICP, a pointtoplane error has been shown to provide better and faster convergence then a pointtopoint metric. It assumes alignment of source points (here from a model view) to points and normals at the destination (here the scene). Normals in camera space can be approximated from depth images [9] but are usually noisy and require time. We thus propose a novel planetopoint error where the normals are coming from the source point set and have been computed beforehand for each viewpoint. This ensures a fast runtime and perfect data alignment since tangent planes coincide at the optimum.
Given the current pose and closest viewpoint with local interior points , we transform to and project each to get the corresponding scene point . Since we also have a local that we bring into the scene, , we want to retrieve minimizing
(13) 
The difference to the established pointtoplane error is solving for an additional rotation of the source normal . Note that only the rotational part of acts on and we thus omit the translational generators of the Lie algebra. Deriving in respect to ^{3}^{3}3The derivation can be found in the supplementary material., we get a Jacobian and a residual for each correspondence
(14) 
(15) 
and construct a normal system to get a twist of the form
(16) 
Altogether, we can now plug together Eqs. 8 and 13 to formulate the desired energy from Eq. 7 as a joint contour and planetopoint alignment. Following up, we build a normal system that contains both the raywise contour Jacobians from 2D image data and correspondencewise ICP Jacobians from 3D cloud data:
(17) 
Solving the above system yields a twist with which the current pose is updated. The advantage of such a formulation is that we employ entities from different optimization problems into a common framework: while the color pixels minimize a projective error, the cloud points do so with a geometrical error. These complimentary cues can therefore compensate for each other if a segmentation is partially wrong or if some depth values are noisy.
3.4 Approximating for realtime tracking
Computing the SDF from Eq. 1 has already three costly steps. We need a silhouette rendering of the current model pose, an extraction of the contour and lastly, a subsequent distance transform embedding . While [29] perform GPU rendering and couple computation of the SDF and its gradient in the same pass to be faster, [17] perform hierarchical raytracing on the CPU and extract the contour via Scharr operators. We make two key observations:

Only the actual contour points are required

Neighboring points provide superfluous information because of similar curvature
We thus propose a cheap yet very effective approximation of the model render space that avoids both online rendering and contour extraction. In an offline stage, we equidistantly sample viewpoints on a unit sphere around the object model, render from each and extract the 3D contour points to store viewdependent sparse 3D sampling sets in local object space (see Fig. 4). Since we will utilize these points in 3D space, we neither need to sample in scale nor for different inplane rotations. Finally, we also store for each contour point its 2D gradient orientation and sample a set of interior surface points with their normals (see Fig. 5).
In a naive approach, all involved terms from Eq. 8 would be computed densely, i.e. , which is prohibitively costly for realtime scenarios. The related work evaluates the energy only in a narrow band around the contour since the residuals decay quickly when leaving the interface. We therefore propose to compute Eq. 10 in a narrow band along a sparse set of selected contour points where we compute along rays. Each projected contour point shoots a positive and negative ray perpendicularly to the contour, i.e along its normal. Building on that, we introduce the idea of ray integration for 3D contour points such that we do not create pixelwise but raywise Jacobians which leads to a smaller reduction step and a better conditioning of the normal system in Eq. 11 than [17] and their approach.
To formalize, we have a model pose during tracking and avoid rendering by computing the camera position in object space . We normalize to unit length and find the closest viewpoint quickly via dot products:
(18) 
Each local 3D sample point of the contour from is then transformed and projected to a 2D contour sample point which is then used to shoot rays into the object interior and into the opposite direction.
To get the orientation of each ray, we cannot rely anymore on the value during prerendering since the current model pose might have an inplane rotation not accounted for. Given a contour point with 2D rotation angle during prerendering, we could embed it into 3D space via and later multiply it with the current model rotation . Although this works in practice, the projection of onto the image plane can be off at times. We thus propose a new approximation of the inplane rotation where we seek to decompose s.t. one part describes a general rotation around the object center in a canonical frame and the other a rotation around the view direction of the camera (i.e. inplane) . Although illposed in general, we exploit our knowledge about the closest viewpoint by assuming and propose to approximate a rotation on the xyplane via
(19) 
We then extract the angle via the first element. With larger viewpoint deviation , this approximation worsens but our sphere sampling is dense enough to alleviate this in practice. We reorient each contour gradient and shoot rays to compute the residuals and from Eq. 8 (see Fig. 5 to compare the orientations and the bottom row in Figure 2 for the SDF rays).
The final missing building block is the derivative of the SDF which cannot be computed numerically since we are missing dense information. We thus compute it geometrically, similar to [17]. Whereas their computation is exact when assuming local planarity by projections onto the principal ray, our approach is faster while incurring a small error which is negligible in practice. Given a ray from contour point we compute the horizontal derivative at as central difference
(20) 
The vertical derivative is computed analogously. Like the related work, we perform all computations on three pyramid levels in a coarsetofine manner and shoot the rays in a band of 8 steps on each level. Since we shoot two rays per contour point, our resulting normal system holds two ray Jacobians per point.
3.5 Implementation details
Our method runs in C++ on a single core of an i75820K@3.3GHz. In total, we render a model from equidistant 642 views, amounting to around 8 degrees in angular difference between two viewpoints. To compute the histograms we avoid rendering and instead fetch the colors at the projected interior points for the foreground histogram. For the background histogram, we compute the rectangular 2D projection of the model’s 3D bounding box and take the pixels outside of it. We employ 1D lookup tables for both and its derivative to speed up computation. Lastly, if we find a projected transformed point to be occluded, i.e. , we discard this point for all computations.
4 Evaluation
To provide quantitative numbers and to selfevaluate our method on noisefree data, we run the first set of experiments on the synthetic RGBD dataset of Choi and Christensen [3]. It provides four sequences of 1000 frames where each covers an object around a given trajectory. Later, we run convergence experiments on the LineMOD dataset [9] and evaluate against Tan et al. on two of their sequences.
4.1 Balancing the tracking energy with
To understand the balancing between contour and interior points, we analyze the influence of a changing . It should both compensate for a different number of sampling points and numerical scale. We fix the sample points to 50 for both modalities to focus solely on the scale difference from the Jacobians. While the ICP values are metric, ranging around , the values from the contour Jacobians are in image coordinates and can therefore be in the thousands. We chose two sequences, namely ’kinect_box’ and ’tide’, and varied . All four sequences are impossible to track via contour alone () since the similarity between foreground and background is too large. On the other hand, relying on a planetopoint energy alone () leads to planar drifting for the ’kinect_box’. We therefore found to be a good compromise (see Figure 6).
4.2 Varying the number of sampling points
With a fixed , we now look at the behavior when we change the number of sample points. We chose again the ’tide’ since it has rich color and geometry to track against. As can be seen in Figure 7, we decrease constantly until 30 points where the translational error plateaus while the rotational error decays further, plateauing around 8090 points. We were surprised to see that a rather small sampling set of 10 contour/interior points already leads to proper energy solutions, enabling successful tracking on the sequence.
4.3 Comparison to related work
We ran our method with and points both on the contour and the interior. Since we wanted to measure the performance of the novel energy alignment with and without the additional cloud weighting, we repeated the experiments for both scenarios. We evaluate accordingly with the others by computing the RMSE on each translational axis as well as each rotational axis. As can be seen from Table 1
, we outperform the other methods greatly, sometimes even up to one order of magnitude. This result is not really surprising, since we are the only method that does a direct, projective energy minimization. While both C&C and Krull use a particle filter approach that costs them more than 100ms, Tan evaluates a Random Forest based on depth differences. Tan and C&C employ depth information only whereas Krull uses RGBD data like us.
If we compare our runtimes, we are very close to Tan. While they constantly need around 1.5ms, we need less than 3ms on average to compute the full update. If we compute the added cloud weighting, it takes us another 6ms but yields the lowest report error so far on this dataset. Note that both Tan and Krull require a training stage to build their regression structures whereas our method only needs to render 642 views and extract sample information. This takes about 5 seconds in total and requires roughly 10MB per model. Additionally, if we compare to the GPUenabled dense implementation of Tjaden et al. [29], we are roughly four times faster on a single CPU core.
PCL  C&C  Krull  Tan  A  B  
(a) Kinect Box 
43.99  1.84  0.8  1.54  1.2  0.76  
42.51  2.23  1.67  1.90  1.16  1.09  
55.89  1.36  0.79  0.34  0.30  0.38  
7.62  6.41  1.11  0.42  0.14  0.17  
1.87  0.76  0.55  0.22  0.23  0.18  
8.31  6.32  1.04  0.68  0.22  0.20  
ms  4539  166  143  1.5  2.70  8.10  
(b) Milk 
13.38  0.93  0.51  1.23  0.91  0.64  
31.45  1.94  1.27  0.74  0.71  0.59  
26.09  1.09  0.62  0.24  0.26  0.24  
59.37  3.83  2.19  0.50  0.44  0.41  
19.58  1.41  1.44  0.28  0.31  0.29  
75.03  3.26  1.90  0.46  0.43  0.42  
ms  2205  134  135  1.5  2.72  8.54  
(c) Orange Juice 
2.53  0.96  0.52  1.10  0.59  0.50  
2.20  1.44  0.74  0.94  0.64  0.69  
1.91  1.17  0.63  0.18  0.18  0.17  
85.81  1.32  1.28  0.35  0.12  0.12  
42.12  0.75  1.08  0.24  0.22  0.20  
46.37  1.39  1.20  0.37  0.18  0.19  
ms  1637  117  129  1.5  2.79  8.79  
(d) Tide 
1.46  0.83  0.69  0.73  0.36  0.34  
2.25  1.37  0.81  0.56  0.51  0.49  
0.92  1.20  0.81  0.24  0.18  0.18  
5.15  1.78  2.10  0.31  0.20  0.15  
2.13  1.09  1.38  0.25  0.43  0.39  
2.98  1.13  1.27  0.34  0.39  0.37  
ms  2762  111  116  1.5  2.71  9.42  
Mean 
Tra  18.72  1.36  0.82  0.81  0.58  0.51 
Rot  29.70  2.45  1.38  0.37  0.28  0.26  
ms  2786  132  131  1.5  2.73  8.71  

4.4 Convergence properties
Since our proposed joint energy has not been applied in this manner before, we were curious about the general convergence behavior. To this end, we used the reallife LineMOD dataset [10]. Although designed for object detection, it has ground truth annotations for 15 textureless objects and we thus mimic a tracking scenario by perturbing the ground truth pose and ’tracking back’ to the correct pose. More precisely, we create 1000 perturbations per frame by randomly sampling a perturbation angle in the range [,] separately for each axis and a random translational offset in the range [,] where is of the model’s diameter. This yields more than 1 million runs per sequence and configuration, giving us a rigorous quantitative convergence analysis which we are presenting in Figure 8 on 3 sequences^{4}^{4}4In the supplement, we present the figures for all sequences. as histograms over the final rotational error. We also plot the mean LineMOD score for each . For this, the model cloud is transformed once with the ground truth and with the retrieved pose and if the average Euclidean error between the two is smaller than of the diameter, we count it as positive. Our optimization is iterative and coarsetofine on three levels and we thus computed above score for a different set of iterations. For example 221 indicates 2 iterations at the coarsest scale, 2 at the middle and 1 at the finest.
During tracking a typical change in pose rarely exceeds on each axis and for this scenario, we can report nearperfect results. Nonetheless, we fare surprisingly well for more difficult pose deviations and degrade gracefully. From the LineMOD scores we see that one iteration on the finest level is not enough to recover stronger perturbations. For very high , the additional iterations on the coarser scales can make a difference in up to 10% which is mainly explained by the SDF rays, capturing larger spatial distances.
4.5 Realdata comparison to stateoftheart
We thank the authors from Tan et al. for providing two sequences together with ground truth annotation such that we could evaluate our algorithm in direct comparison to their method. In contrast to us, their method has a learned occlusion handling builtin. Both sequences feature a rotating table with a center object to track, undergoing many levels of occlusion. As can be seen from Figure 9 we outperform their approach, especially on the second sequence.
4.6 Failure cases
The weakest link in the method is the posterior computation since the whole contour energy is dependent on it. In the case of blur or a sudden change of colors (e.g. illumination) the posterior is misled. Furthermore, with our approximative SDF we sometimes fail for small or nonconvex contours where the inner rays are overshooting the interior.
5 Conclusion
We have demonstrated how RGB and depth can be utilized in a joint fashion for the goal of accurate and efficient 6D pose tracking. The proposed algorithm relies on a novel optimization scheme that is general enough to be individually applied on either the depth or the RGB modality, while being able to fuse them in a principled way when both are available. Our system runs in realtime using a single CPU core, and can track around 10 objects at 30Hz, which is a realistic upper bound on what can visually fit into one VGA image. At the same time, it is able to report stateoftheart accuracy and inherent robustness towards occlusion.
Acknowledgments
The authors would like to thank Henning Tjaden for useful implementation remarks and Toyota Motor Corporation for supporting and funding this work.
References
 [1] (2008) Robust RealTime Visual Tracking using PixelWise Posteriors. In ECCV, Cited by: §1, §2.
 [2] (2010) Combined Region and MotionBased 3D Tracking of Rigid and Articulated Objects. TPAMI. Cited by: §1, §2.
 [3] (2013) RGBD Object Tracking: A Particle Filter Approach on GPU. In IROS, Cited by: §2, Table 1, §4.
 [4] (2007) A review of statistical approaches to level set segmentation: Integrating color, texture, motion and shape. IJCV. Cited by: §1.

[5]
(2010)
A Geometric Approach to Joint 2D RegionBased Segmentation and 3D Pose Estimation Using a 3D Shape Prior
. SIAM Journal on Imaging Sciences. External Links: ISSN 19364954 Cited by: §1, §2.  [6] (2002) Realtime visual tracking of complex structures. TPAMI. External Links: ISBN 01628828, ISSN 01628828 Cited by: §2.
 [7] (2001) Robust registration of 2D and 3D point sets. In BMVC, Cited by: §3.2.
 [8] (2016) 2D3D Pose Estimation of Heterogeneous Objects Using a Region Based Approach. IJCV. Cited by: §2.
 [9] (2012) Gradient Response Maps for RealTime Detection of Textureless Objects. TPAMI. Cited by: §1, §3.3.2, §4.
 [10] (2012) Model Based Training, Detection and Pose Estimation of TextureLess 3D Objects in Heavily Cluttered Scenes. In ACCV, Cited by: §4.4.
 [11] (2012) Online learning of linear predictors for realtime tracking. In ECCV, Cited by: §1.
 [12] (2015) Hashmod: A Hashing Method for Scalable 3D Object Detection. In BMVC, Cited by: §1.
 [13] (2014) 6DOF Model Based Tracking via Object Coordinate Regression. In ACCV, Cited by: §2, Table 1.
 [14] (2011) KinectFusion: Realtime dense surface mapping and tracking. In ISMAR, Cited by: §2.
 [15] (2011) Textureless object tracking with online training using an RGBD camera. In ISMAR, Cited by: §1, §3.3.
 [16] (2008) Multiple 3D Object tracking for augmented reality. In ISMAR, Cited by: §1.
 [17] (2015) RealTime 3D Tracking and Reconstruction on Mobile Phones. TVCG. Cited by: §1, §2, §3.4, §3.4, §3.4.
 [18] (2012) PWP3D: RealTime Segmentation and Tracking of 3D Objects. IJCV. Cited by: §1, §1, §2, §3.1, §3.1.
 [19] (2014) 3D Tracking of Multiple Objects with Identical Appearance using RGBD Input. In 3DV, Cited by: §1, §1, §2.
 [20] (2013) STAR3D: Simultaneous tracking and reconstruction of 3D objects using RGBD data. In ICCV, Cited by: §1, §1, §2.
 [21] (2006) A comparison of shape matching methods for contour based pose estimation. LNCS. Cited by: §2.
 [22] (2007) RegionBased Pose Tracking. In IbPRIA, Cited by: §1, §2.
 [23] (2012) Regionbased pose tracking with occlusions using 3D models. MVA. Cited by: §1, §2.
 [24] (2014) Optimal local searching for fast and robust textureless 3D object tracking in highly cluttered backgrounds. In TVCG, Cited by: §2.
 [25] (2016) SDF2SDF: Highly Accurate 3D Object Reconstruction. ECCV. Cited by: §2.
 [26] (2014) Multiforest tracker: A Chameleon in tracking. In CVPR, Cited by: §2.
 [27] (2015) A Versatile Learningbased 3D Temporal Tracker : Scalable , Robust , Online. In ICCV, Cited by: §2, Table 1.
 [28] (2009) Modelbased 3D Object Tracking with Online Texture Update. In MVA, Cited by: §2.
 [29] (2016) RealTime Monocular Segmentation and Pose Tracking of Multiple Objects. In ECCV, Cited by: §1, §1, §1, §2, §3.1, §3.3.1, §3.4, §4.3.
 [30] (2004) Stable RealTime 3D Tracking Using Online and Offline Information. TPAMI. Cited by: §1.
 [31] (2014) 3D object tracking via boundary constrained regionbased model. In ICIP, Cited by: §1, §2.
 [32] (2015) Depth Camera Tracking with Contour Cues. In CVPR, Cited by: §1, §2.
6 Convergence on the LineMOD dataset
We run the experiments on all nonsymmetric objects of the LineMOD dataset since the introduced LineMOD measure for symmetric objects is very misleading and allows obvious wrong poses to be regarded as correct. Since we are interested in very accurate poses, these results would not have provided additional insight. As in the main paper, the degradation for larger deviations is increasing but we do not see a sharp sudden decline. Again, additional iterations on multiple levels drastically improves the general alignment under all perturbations.
7 Optimization of the Contour Energy
Starting from the probability of a contour when given an observed image
(21) 
that we seek to maximize, one can instead minimize the negative log which breaks down to a sum of pixelwise terms:
(22) 
The derivation in respect to a change in pose
(23) 
can be then expressed via the following terms. Firstly, the smoothed Heaviside and its derivative as a smoothed Dirac delta
(24) 
with being the grade of applied smoothing to the contour embedding in our implementation. Provided the 3D point to the projected 2D point and intrinsics , we write the rest of the derivatives as
(25) 
Given with the other explanations from the paper, we can now compute the Jacobian and update the pose as explained via Cholesky decomposition and the exponential map.
8 Optimization of the PlanetoPoint Energy
Given source points and source normals , we seek an alignment to destination points such that
(26) 
Deriving in respect to for a given correspondence yields us
(27) 
(28) 
Starting from here, we want to employ a GaussNewton scheme for the optimization. We thus seek an increment around such that we minimize the error, i.e. we conduct Taylor expansion around zero s.t.
(29) 
Following the typical approximation scheme, we disregard the higher order terms. Defining the residual and a Jacobian , we arrive at
(30) 
To minimize , we can now derive in respect to and set it to zero to find the best update :
(31) 
Finally, to fuse it seamlessly into the contour optimization, we negate (or alternatively ) and retrieve the final normal system for the joint energy from the paper.
9 Derivation of the Posterior
What we essentially want to formulate is the probability of a model pose and its projected silhouette , given color image and cloud data ,
(32) 
This expression is difficult to compute in general since it involves 2D and 3D entities as well as a 6D pose. Instead, we make the first step towards tractability by assuming that each pixel is independent. We thus rephrase
as a binary variable
that signifies whether a certain pixel is foreground or background. Dealing now with pixelwise colors and cloud points , we also assume that a pose and its projected silhouette are independent entities, given and :(33) 
We now go into the derivation of those two terms. Applying Bayes’ rule, we get
(34) 
and we now assume further that colors and cloud points are independent, given the foreground or pose model. From here, we marginalize over both instances of as well as the model pose space:
(35) 
We compute and from color histograms whereas and . While the marginalization over foreground and background is straightforward, it is intractable for the model pose space. As mentioned in the paper, we assume both and to be uniform. Furthermore, since the integration over all valid of our cloud term is constant, we reduce ourselves to a proportionate measure.