Tracking the pose of a rigid object in monocular videos is a fundamental challenge in computer vision with numerous applications in mixed reality, robotics, medical navigation and human-computer interaction. Given an image sequence, the aim is to robustly and accurately determine the translation and rotation of a known rigid object relative to the camera between frames. While this problem has been intensively studied, accurate real-time 6DOF pose estimation that is robust to background clutter, partial occlusions, motion blur or defocus remains an important open problem (see Figure 1).
1.1 Real-Time Monocular Object Tracking
Especially in the constantly growing fields of mixed reality and robotics, object pose estimation is typically only one of many complex tasks that must all be computed simultaneously in real-time on often battery powered hardware. Therefore low runtime and power consumption of dedicated solutions are crucial aspects for them to be practical in such scenarios. In particular the latter can be achieved by using as few sensors as possible e.g. only a single camera. Also as for most other computer vision problems, such monocular approaches are usually the most convenient compared e.g. to multi-camera stereo systems because they keep calibration requirements at a minimum and suffer least from visibility issues.
In the past several different strategies have been proposed to the problem of monocular object pose tracking. One popular approach is to detect intensity gradient-based features such as corners or edges in the images and then perform pose estimation my matching these against a suitable 3D model representation (see e.g. [18, 33, 26, 8, 34, 13]). Since here the input image is directly reduced to a sparse set of features a major advantage of such approaches is that they can typically be computed in real-time even on low powered hardware. The main drawback of especially point feature-based methods is however, that they require the objects to be well-textured in order to show sufficient local intensity variation in the images from all perspectives. In manmade environments this usually significantly limits the variety of suitable objects to those with text or image graphics printed on them.
For weakly-textured or textureless objects, for many years edge features, describing either their contour or strong intensity gradients of angular structures inside the projected object region, have been shown to be more suitable alternative. Methods relying on image edges are however prone to fail in cluttered scenes that show strong gradients frequently in the background. This potentially causes the pose estimation to end up in local minima (see also e.g. ). Feature based methods furthermore struggle with motion blur, low lighting conditions and increasing distance to the camera generally causing the features to appear less distinct in the images.
More recently so-called region-based methods have gained increasing popularity (see e.g. [25, 5, 2, 27, 21, 7, 32]) since they are potentially suitable for a large variety of objects in complex scenarios regardless of local intensity gradients i.e. their texture. These approaches assume differing image statistics between the object and the background region. Based on a suitable statistical appearance model as well as a 3D shape prior (i.e. a 3D model of the object), here pose estimation essentially works by aligning two silhouettes. Here, the target is the extracted object’s silhouette in the current image using the segmentation model while the other is rendered synthetically from the shape prior parametrized by the sought pose. The discrepancy between these two shapes is then minimized by changing the pose parameters used for the synthetic projection. In return, given the objects pose in the current frame, the rendered silhouette provides an accurate pixel-wise segmentation mask that is typically used for updating the foreground and background statistics of the appearance model in order to dynamically adapt to scene changes. Therefore given the pose in the first frame of an image sequence allows to initialize the statistical model. Pose tracking than is performed recursively in an interleaved manner by first estimating the pose based on that in the previous frame and then updating the segmentation model afterwards using the mask information in the current frame.
For the sake of completeness we also want to mention that ever since so-called RGB-D cameras (also known as depth sensors) have become available as consumer hardware a couple of years ago, another major category of pose tracking algorithms has emerged that rely on these devices (see e.g. [3, 23, 14, 12, 24, 29]). These sensors potentially measure the per pixel distance to the camera in real-time by combining an infrared light emitter, that actively projects light onto the scene, with an monochrome camera and often an RGB camera in a rigid stereo setup. Due to the additional depth modality such methods commonly outperform those only based on monocular RGB image data. However due to the active lighting strategy they only operate properly within about 10 meters proximity to the device, generally struggle in the presence of sunlight and shiny surfaces and have a higher power consumption than a regular camera. We therefore do not include methods requiring such sensors as related works since we do not consider them sufficiently comparable to the monocular setting.
1.2 Related Work
When only using a single regular RGB camera, to our best knowledge region-based approaches relying on statistical level-set segmentation  are currently achieving state of the art performance for the task of 6DOF object pose tracking. For this reason and due to the large amount of literature on other methods in this domain, here we strictly focus on work that is directly related to our proposed region-based approach. By also presenting a newly constructed complex dataset, we address a gap in literature regarding current publicly available datasets for monocular 6DOF pose tracking of 3D objects. Thus the seconds part of this section gives a comprehensive overview of related datasets and their shortcomings compared to the one we present in this work.
Region-based Pose Tracking Methods – Early region-based pose tracking methods were not real-time capable [25, 2, 27] but already showed the vast potential of the general strategy, by presenting promisingly robust results in many complex scenarios. In these works image segmentation was based on level-sets together with pixel-wise likelihoods used to explicitly extract the object’s contour in each camera frame. Here pose estimation was based on an iterative closest points (ICP) approach by solving a linear system, set up from 2D-to-3D point correspondences between the extracted contour and the 3D model. These correspondences are re-established after each iteration in the 2D image plane to the evolving synthetic contour projection.
In  the authors presented PWP3D, the first region-based approach that achieved real-time frame rates (20–25 Hz) by relying heavily on GPGPU acceleration. Here, pose estimation is performed using a pixel-wise gradient-based optimization scheme similar to the variational approach suggested in . But instead of separately integrating over the foreground and background region, PWP3D uses a level-set pose embedding in a cost function similar to the very early methods above in order to simplify computations and make it real-time capable. Additionally, based on the idea presented in 
the previously proposed pixel-wise likelihoods were exchanged for pixel-wise posterior probabilities which have been shown to provide a wider basin of convergence.
There have been several successive works recently that build upon the general concept of PWP3D. These mostly address two main potential improvements of the original algorithm. The first being the first order gradient descent used for pose optimization involving four different fixed step sizes that have to be adjusted experimentally for each model individually. It also uses a fixed number of iterations to maintain real-time performance and thus suffers from commonly related convergence issues. This optimization was replaced in  with a second order so-called Gauss-Newton-like optimization where the Hessian matrix is approximated from first order derivatives based on linearized twist parametrization. This strategy vastly enhanced the convergence properties resulting in significantly increased robustness towards fast rotations and scale changes. Furthermore by performing this optimization in an hierarchical coarse-to-fine manner the overall runtime of the proposed mainly CPU-based implementation (it only uses OpenGL for rendering) was reduced to achieve frame rates of 50–100 Hz for a single object on a commodity laptop. However in  the optimization strategy was discovered from empirical studies and thus not properly derived analytically.
Another CPU-based approach was presented in  that achieves around 30 Hz on a mobile phone by using an hierarchical Levenberg-Marquardt optimization strategy for the translation parameters and approximating the level-set related computations. However the main speed-up was enabled by including the phone’s gyroscope to obtain the rotation estimate that is only corrected for drift every tenth frame by a single gradient descent step. Due to this sensor fusion the method presented in  can technically not be considered a monocular solution and is furthermore restricted to application scenarios in which the phone moves around a static object.
The second main disadvantage of the original PWP3D method is the rather simple segmentation model based on global foreground and background color histograms which is prone to fail in cluttered scenes. Therefore the authors of  introduce a boundary constraint for improvement, that is however not real-time capable. In  based on the idea presented in , a localized segmentation model was proposed that also relies on pixel-wise posteriors but uses multiple local color histograms to better capture spatial variations of the objects. However in  this approach was neither evaluated for pose tracking in video sequences nor shown to be real-time capable.
In the latest work on region-based pose estimation  presents a real-time capable implementation of  in combination with the ideas of  and further extends the segmentation model by introducing temporally consistency and a pose detection strategy to recover from tracking losses. The resulting algorithm currently achieves state of the art real-time tracking performance by using the Gauss-Newton-like optimization and a segmentation model based on so-called temporally consistent local color histograms.
The Gauss-Newton-like optimization was recently also adopted in  which directly builds up on . Here an extended cost function with respect to the depth modality of an RGB-D device is derived in order to improve on both the robustness of the pose estimation as well as the object segmentation in cluttered environments. To obtain the object pose it is suggested to combine the Gauss-Newton-like approach for the RGB-based term with a straight forward Gauss-Newton strategy for the depth-based term in a joint optimization.
Object Pose Tracking Datasets – There are several different aspects that can potentially be covered by an object pose tracking dataset. One important feature is the type and complexity of motion included in the respective image sequences. Here, the type of motion refers to whether they include either movement of only the camera, only the object or both simultaneously. This is particularly important because it has a direct impact on e.g. the intensity gradients (i.e. shading) inside the object region that depend on the lighting in the scene and how quickly the background changes. Another aspect is whether the light sources in the scene are moving. It is closely related to the case of object motion but typically has an even stronger impact on the objects appearance (e.g. self-shadowing). With regard to the previously discussed gradient-based image features, it should also be considered to include both well-textured as well as weakly textured objects. Datasets can furthermore contain a single or multiple objects, occlusions, different amount of background clutter and motion blur to simulate common problems in real scenarios.
Another essential question when generating an object tracking dataset is how the ground truth pose information is obtained for each frame, since this is basically a chicken-egg problem. One popular and straight forward approach is to render synthetic image sequences from artificial and thus fully controllable 3D scenes. However the resulting images often lack of photo-realism and require sophisticated rendering techniques and expert knowlegde in 3D animation. On the other hand, in case of real-data image sequences typically some sort of fiducial markers are placed in the scene to provide a reference pose from all perspectives independent of the rest of the scene. This requires the relative pose between markers and object to remain static at all times. Datasets of this kind thus often only contain motion of the camera unless markers are also attached to the object which however changes its appearance in an unnatural and undesired way.
In the context of 6DOF object pose tracking a third, sort of intermediate strategy can be applied, where semi-synthetic image sequences are created, combining advantages of both previous solutions. Here, animated renderings of a realistically textured 3D model are composed with a real image sequence providing for a background in each frame (see Section 5.2 for details). This technique has been used for the Rigid Pose dataset presented in , which we consider most closely related to the one we propose. Here semi-synthetic stereo image pair sequences of six different objects are provided, each available in a noise-free, a noisy and an occluded version. However five out of the six objects used within this dataset are particularly well textured. Also the objects are rendered using a Lambertian illumination model without including any directional light source, meaning that the intensity of corresponding pixels between frames does not change. These renderings are furthermore simply pasted onto real images without e.g. blurring their contours in order to smooth the transition between the object and the background (e.g. Figure 2, top left).
Then there is the RGB-D Object Pose Tracking dataset of  that contains two real-data and four fully synthetic sequences including four different objects that are static in the scene. However ground truth information is only provided for the four synthetic sequences which were primarily designed for depth-based tracking. Here, apart from the respective object itself the rest of the scene is very simple and completely texture-less, which is why the resulting RGB color images look very artificial overall (e.g. Figure 2, top right).
Very recently the OPT dataset was presented in , being the most complex 6DOF object pose tracking dataset yet. It contains multiple real-data RBG-D sequences of six 2D patterns and six 3D objects that vary in texture and complexity. The images were captured with a camera mounted on a robot arm that moves around each single object at different speeds and varying lighting conditions. The dataset thereby even covers scenarios with a moving light source and contains motion blur. Despite its complexity the data does not include object motion, background clutter or occlusions, since in all sequences the object is placed statically in front of an entirely white background surrounded by a passive black and white marker pattern (e.g. Figure 2, bottom left).
Lastly there is the dataset of , that contains six real-data RBG-D sequences involving six different objects and partial occlusions. In each sequence multiple static instances of the same object are placed on a cluttered table each surrounded by a marker pattern (e.g. Figure 2, bottom right). Therefore the dataset also only contains movement of the camera. Although this dataset does provide test sequences of consecutive video frames, it was primarily designed for the task of 6DOF object pose detection. In that case the object pose is supposed to be recovered from only a single image as opposed to an image sequence in case of pose tracking. Other pose detection datasets (e.g. ) typically do not contain any consecutive frames at all and can therefore not be used for pose tracking, although the two tasks are actually strongly related.
To summarize, there are currently only a few datasets that have been created explicitly for the task of monocular 6DOF pose tracking. Of those available, [19, 3, 30] are relatively small and do not cover many of the initially mentioned aspects. The most complex data set currently available  unfortunately also does not simulate scenarios in which both the object and the camera are moving (e.g. a hand-held object in a cluttered environment), which we are targeting in this work.
In this work, we derive a cost function for estimating the 6DOF pose of a familiar 3D object observed through a monocular RGB-camera. Our region-based approach involves a statistical image segmentation model, build from multiple overlapping local image regions along the contour of the object in each frame. The core of this model are temporally consistent local color histograms (tclc-histograms) computed within these image regions that are each anchored to a unique location on the object’s surface and thus can be updated with each new frame.
While traditionally such cost functions have been optimized by means of gradient descent, we derive a suitable Gauss-Newton optimization scheme. This is not straight-forward since the cost function is not in the traditional nonlinear least-squares form. In numerous experiments, we demonstrate the effectiveness of our approach in different complex scenarios, including motion of both the camera and the objects, partial occlusions, strong lighting changes and cluttered backgrounds in comparison to the previous state of the art.
This work builds up on two prior conference publications [31, 32]. It expands these works on several levels: Firstly, we propose a systematic derivation of a Gauss-Newton optimization by means of reformulating the optimization problem as a reweighted nonlinear least-squares problem. This further improves the convergence rate and thus the tracking robustness significantly compared to the previous Gauss-newton-like scheme of . Secondly, we explain how our method using tclc-histograms can be extended to multi-object tracking and thereby handle strong mutual occlusions in cluttered scenes and demonstrate its potential for real-time applications of mixed-reality scenarios. Thirdly, we propose a novel large semi-synthetic 6DOF object pose tracking dataset, that covers most of the previously mentioned important aspects. In our opinion this closes a gap in the current literature on object pose tracking datasets since it is the first to simulate the common scenario in which the camera and the objects are moving simultaneously under challenging conditions.
The rest of the article is structured as follows: Section 2 presents the derivation of the cost function as well as the involved statistical segmentation model based on tclc-histograms. The systematic derivation of the Gauss-Newton optimization for pose estimation of this cost function is given in Section 3. This is followed by implementation details in Section 4 and the introduction of our dataset in Section 5 where also an extensive experimental evaluation of our approach is provided. The article concludes with Section 6 and the acknowledgements in Section 7.
2 Temporally Consistent Local Color Histograms for Pose Tracking
We begin by giving an overview of basic mathematical concepts and the notation used within this work. Then we will derive the region-based cost function based on tclc-histograms. To simplify notation, we first consider the case of tracking a single object and then extend it to multiple objects.
In this work we represent each object by a 3D model in form of a triangle mesh with vertices , . We denote a camera RGB color image by assuming 8-bit quantization per intensity channel. The color at a pixel location in the 2D image plane is then given by . By projecting a 3D model into the image plane we obtain a binary silhouette mask denoted by that yields a contour splitting the image into a foreground region and a background region (see Figure 3).
The pose of an object describing the rigid body transform from of its 3D model coordinate frame to the camera coordinate frame is represented by a homogeneous matrix
being an element of the Lie-group . We assume the intrinsic parameters of the camera to be known from an offline pre-calibration step. By denoting the linear projection parameters in form of a matrix
and assuming that all images have been rectified by removing lens distortion, we describe the projection of a 3D model surface point into the image plane by
with . Here, the tilde-notation marks the homogeneous representation of the point .
For pose tracking we denote a time-discrete sequence of images by . Each image is captured at time , , with being the current live image. Accordingly, we compute the trajectory of an object by estimating a sequence of rigid body transformations , , each corresponding to the related video frame. By assuming that the pose in the previous frame is known, we perform pose tracking in form of a so-called recursive pose estimation. For this we express the current live pose as . Here is the pose difference that occurred between the last and the current frame. For a new live image we thus always only need to compute the remaining in order to obtain the current live pose , as long we do not lose tracking.
For pose optimization, we model the rigid body motion between and with twists
being elements of the Lie-algebra corresponding to the Lie-group
. Each twist is parametrized by a six-dimensional vector of so-calledtwist coordinates
and the matrix exponential
maps a twist to its corresponding rigid body transformation. For detailed information on Lie groups and Lie algebra please refer to e.g. 
2.2 The Region-based Cost Function
Our approach is essentially based on statistical image segmentation . As usual in this context, we represent the object’s silhouette implicitly by a so-called shape-kernel . This is a level-set embedding of the object’s shape such that the zero-level line gives its contour i.e. the boundary between and . Here, we use the shape-kernel
being the Euclidean distance between a pixel position and the contour of the binary silhouette mask .
It describes the posterior probability of the shape kernel given an image , with being a smoothed Heaviside step function. Here, and represent the per pixel foreground and background region membership probability, based on the underlying statistical appearance models (see Section 2.3). In the general context of 2D region-based image segmentation, the closed curve would be evolved in an unconstrained manner such that it maximizes and thus the discrepancy between the foreground and background appearance model statistics. In our scenario however the evolution of the objects contour is constrained by the known shape prior in form of a 3D model. Therefore the shape kernel only depends on the pose parameters, i.e. . Assuming pixel-wise independence and taking the negative log of (9), we obtain the region-based cost function
This function can be optimized with respect to twist coordinates for pose estimation based on 2D-to-3D shape matching. In our approach we define the Heaviside function explicitly as
with determining the slope of the smoothed transition (see Section 4 for details).
2.3 Statistical Segmentation Model
In the past, different appearance models have been proposed to compute and used in (10). Initially  used a global appearance model based on the color distribution in both and . Here each region has its own model denoted by for the foreground i.e. the object and for the background. Each of them is represented with a global color histogram. Based on this, the region membership probabilities are calculated in form of pixel-wise posteriors as
This model is also used within PWP3D  where it is further suggested to keep the appearance models temporally consistent, in order to adapt to scene changes while tracking the object. Having successfully estimated the current live pose allows to render a corresponding silhouette mask denoted by . Ideally, this mask provides an exact segmentation of the object region in the current camera frame and can thus be used in order to compute up-to-date color histograms and . Instead of always using the latest color distribution for the appearance models,  suggested to recursively adjusting the histograms by
to prevent them from being corrupted by occlusions or pose estimation inaccuracies. Here, and denote foreground and background learning rates.
All the above is based on the assumption that the global color distribution is sufficiently descriptive in order to distinguish between the foreground and the background region. Therefore, this appearance model has been shown to work particularly well with homogeneous objects of a distinct color that is not dominantly present in the rest of the scene. However, for objects with heterogeneous surfaces and in case of cluttered scenes this global model is prone to fail.
Hence, in  a localized appearance model was proposed for the PWP3D approach that better captures spatial variations of the object’s surface. The idea is to build the segmentation model from multiple overlapping circular image regions along the object’s contour as originally introduced in . We denote each such local region by with radius , centered at pixel . Now, splits each into a foreground region and a background region (see Figure 3). This allows to compute local foreground and background color histograms for each region. In  this led to the localized cost function
using the masking function
which indicates whether a pixel lies within a local region or not. Here the local region membership probabilities and are computed individually from the local histograms as
in analogy to the global model. In , however, temporal consistency of the local appearance models was not addressed. The local region centers were calculated as arbitrary sets of pixel locations along for each image. Thus, this approach in general does not allow to establish correspondences of the centers across multiple frames (i.e. ) which is required in order to update the respective histograms.
This issue has been addressed in  by introducing a segmentation model based on temporally consistent local color histograms (tclc-histograms), which we adopt in this work. Here each 3D model vertex is associated with a local foreground and background histograms (see Figure 4). In contrast to  this allow us to compute the histogram centers by projecting all model vertices into the image plane, i.e. and selecting the subset of all . Since each histogram is anchored to the objects surface, center correspondences between frames are simply given by the projection of corresponding surface points, i.e. . This ensure to keep the individual histograms temporally consistent. Whenever a model vertex projects onto the contour for the first time, its corresponding histograms are initialized from the local region around its center in the current frame. Otherwise, if its histograms already contain information from a previous frame, we update them as
in analogy to (14).
In  it has furthermore been shown that computing the average energy (15) over all local regions potentially suffers from the same segmentation problems locally, as the previous approach based on the global appearance model. More robust results can be obtained by computing the average posteriors from all local histograms instead as
and use these within (10). We now can define the energy function
that we use for our pose tracking approach based on tclc-histograms.
2.4 Extension to Using Multiple Objects
In the following, we will explain how our approach easily extends to tracking multiple objects simultaneously. For this each object is represented by its own 3D model, with corner vertices , , where is the total number of objects. Accordingly, the individual poses are denoted by . Projecting all models into the image plane yields a common segmentation mask . It contains contours that split the image into multiple foreground regions and background regions (see Figure 5).
For pose tracking we optimize a separate energy function for each object denoted by
with its own level-set and region membership probabilities and computed from its individual set of tclc-histograms. Here denotes the number of vertices per model. Each such optimization is performed regardless of the other objects as long as they do not occlude each other. However, in cases of mutual occlusions, the foreground regions overlap, which results in contour segments that do not belong to the actual silhouette of the objects (see again Figure 5). These cases must be detected and handled appropriately during pose optimization as explained in detail in Section 4.4.
This shows that for all formulas, extending our approach to multiple objects is essentially done by adding the object index . For the sake of clarity we will drop in the rest of this article again, unless absolutely required.
3 Pose Optimization
Traditionally, cost functions of form (10) have been optimized using gradient descent (see e.g.  or ). In comparison to second order (Newton) methods, this has several drawbacks. First, one has to determine suitable time step sizes associated with translation and rotation. Too small step sizes often lead to very slow convergence, too large step sizes easily induce oscillations and instabilities. For the gradient descent-based PWP3D method , for example, one needs to manually specify the number of iterations and three different step sizes (one each for rotation, translation along the optical axis and translation within the camera’s image plane). These need to be adapted at least once for each new object. Moreover, as shown in the exemplary comparison in Figure 8, often no suitable compromise between numerical stability and convergence to the desired solution can be achieved. Second, convergence is typically not as robust and rather slow (especially near the optimum), making the technique less suitable for accurate and robust real-time tracking.
Applying a second order optimization scheme, on the other hand, is not straight forward because the cost function (10) is not in the classical form of a nonlinear least-squares problem. In the following, we will propose a strategy to circumvent this issue, based on rewriting the original problem in form of a re-weighted nonlinear least squares estimation. This is different from an (in our view less straight forward) derivation proposed by Bibby and Reid , which requires a Taylor series approximation of the square root.
3.1 Derivation of a Gauss-Newton Strategy
The cost function in (21) can be written compactly as
Unfortunately, this is not in the traditional form of a nonlinear least squares estimation problem for which the Gauss-Newton algorithm is applicable. However, we can simply rewrite this expression as a nonlinear weighted least squares problem of the form
To optimize this cost function, one can apply the technique of iteratively reweighted least squares estimation which amounts to solving the above problem for fixed weights by means of Gauss-Newton optimization and alternatingly updating the weights . Over the iterations, these weights will adaptively reweight respective terms.
In the fixed-weight assumption, the energy gradient is given by
and the Hessian is given by
The Gauss-Newton algorithm emerges when applying a Newton method and dropping the second-order derivative of the residual . This approximation is valid if either the residual itself is small (i.e. near the optimum) or if the residuum is close to linear (in which case ). If we denote the Jacobian of the residuum at the current pose by
under the above assumptions, the second order Taylor approximation of the cost function is given by
This leads to the optimal Gauss-Newton update step of
We apply this step as composition of the matrix exponential of the corresponding twist with the previous pose as
in order to remain within the group .
3.2 Computation of the Derivatives
The per pixel Jacobian term (28
) is computed by applying the chain-rule as
where is the smoothed Dirac delta function corresponding to , i.e.
Since , the derivatives of the signed distance transform are given by
Assuming small motion, we perform piecewise linearization of the matrix exponential in each iteration, i.e. and therefore we get
with . For pixels , we choose to be the 3D surface point in the camera’s frame of reference that projects to its closest contour pixel . Finally, we compute the derivatives of with respect to a pixel as 2D image gradients, utilizing central differences, i.e.
In order to increase the convergence speed, iterative pose optimization is computed in an hierarchical coarse to fine approach. This also makes the tracking more robust towards fast movement or motion blur. Details on our concrete multi-level implementation are given in Section 4.2.
During successful tracking, for every new frame the optimization starts at the previously estimated pose . Note that also the histogram centers i.e. the regions are obtained from and remain unchanged during the entire iterative optimization process. They provide the information used to compute and from the intensities in current frame.
To start tracking or recover tracking from tracking loss our approach can be combined with a pose detection module in order to obtain the initial pose. As shown in , with the help of manual initialization, tclc-histograms can act as an object descriptor for pose detection based on template matching. Due to the employed temporal consistency strategy this descriptor is trained online within a couple of seconds by tracking the object and showing it to the camera from different perspectives. This approach is particularly efficient to recover from temporary tracking loss e.g. caused by massive occlusion or if the object leaves the camera’s field of view. Another advantage of this approach is that it can be computed at frame rates of 4 – 10 Hz for a single object on commodity laptop CPU. However, the tclc-histogram based descriptors struggle in previously unseen environments if the foreground and background color distribution differs too much from the scene they were originally trained on.
When more computational power is available, recent deep learning-based approaches (see e.g.[22, 11]) also could be used for pose detection. These are currently achieving state of the art results and can be trained only from synthetic images [11, 10] which make them robust to different environments and lighting conditions. However, they require a powerful GPU in order to achieve similar frame rates as our method based on tclc-histograms running on a CPU.
In the following we provide an overview of our C++ implementation with regard to runtime performance aspects. We perform all major processing steps in parallel on the CPU and only use the GPU via OpenGL for rendering purposes.
4.1 Rendering Engine
We use the standard rasterization pipeline of OpenGL in order obtain the silhouette masks . Since we want to process the rendered images on the CPU, we perform offscreen rendering into a FrameBufferObject which, is afterwards downloaded to host memory. To generate synthetic views that match the real images, the intrinsic parameters (2) of the camera need to be included. For this, we model the transformation from 3D model coordinates to homogeneous coordinates within the canonical view volume of OpenGL as . Here, is the so-called look-at matrix
that aligns the principal axes of the real cameras coordinate frame with those of the virtual OpenGL camera and is a homogeneous projection matrix
with respect to the camera matrix . The scalars , are the width and height of the real image and , are the near- and far-plane of the view frustum described by .
In case of tracking multiple objects, all 3D models are rendered in the same scene. Each mesh is rendered with a constant and unique color that corresponds to its model index . This allows to separate the individual foreground regions and identify their contours as required for computing the different level-sets. Here, mutual occlusions are natively handled by the OpenGL -Buffer.
As seen in (35), the derivatives used for pose optimization involve the coordinates of the 3D surface point in the camera’s frame of reference, corresponding to each pixel . In addition to the silhouette mask, we therefore also download the -buffer into a per pixel depth map . Given , the required coordinates are efficiently determined via backprojection as , with
where is the homogeneous representation of an image point .
In  it has been shown that it is beneficial not only to consider the points on the surface closest to the camera but also the most distant ones (on the backside of the object) for pose optimization. In order to obtain the respective coordinates for each pixel, we compute an additional reverse depth map , for which we simply invert the OpenGL depth check used to compute the corresponding -buffer (see Figure 6). Given , the farthest surface point corresponding to a pixel is also recovered as .
We perform pose optimization hierarchically within a three level image pyramid generated with a down-scale factor of 2. The third level thereby corresponds to the camera matrix , the second to , and the first to . In our current real-time implementation we first perform four iterations on the third, followed by two iterations on the second and finally a single iteration on the first level i.e. the original full image resolution. In case of multiple objects all poses are updated sequentially once per iteration.
Each iteration starts by rendering and downloading the common silhouette mask and depth map as well as individual reverse depth maps based on the current pose estimates . To distinguish multiple objects, we render each model silhouette region using a unique intensity corresponding to the model index . Here, hierarchical rendering is achieved by simply adjusting the width and height of the OpenGL viewport according to the current pyramid level. Next, the individual signed distance transforms are computed from . For this we have implemented the efficient two-pass algorithm of  in parallel on the CPU. Here, the first pass runs in parallel for each row and the second pass for each column of pixels. In addition to the distance value, we also store the 2D coordinates of the closest contour point to every pixel . This is required for obtaining the corresponding 3D surface point needed to calculate the derivatives of (35) with respect to a background pixel.
Finally, the Hessian and the gradient of the energy needed for the parameter step (30) are calculated in parallel for each row of pixels on the CPU. Here each thread calculates its own sums over and which are finally added up in the main thread. Following PWP3D, for each pixel we add both the Jacobian terms with respect to the coordinates of as well as . For a further speed-up, we exploit that the Hessian is symmetrical, meaning that we only have to calculate the upper triangular part of it. The update step is then computed using Cholesky decomposition.
In our current implementation we choose within Heaviside function (11) regardless of the pyramid level. We therefore always only need to perform pose optimization i.e. compute the derivatives of each cost function within a band of px around each contour i.e. with (see Figure 7). For other distances the corresponding Dirac delta value becomes very small. Since scales all other derivatives per pixel (see (32)), those outside this narrow band have a neglectable influence on the overall optimization. This further allows to restrict the processed pixels to a 2D ROI (region of interest) containing this contour band for an additional speed-up. We obtain this ROI by computing the 2D bounding rectangle of the projected bounding box of a model expanded by 8 pixels in each direction. This is done efficiently on the CPU without performing a full rendering.
Due to the multi-scale strategy, it can easily happen that an object region only projects to a small amount of pixels in higher pyramid level at far distances to the camera. The derivatives computed from such few pixels can typically be less trusted and thus often move the optimization in the wrong direction. To encounter this effect we compute the area of the 2D bounding and check if it is too small at the current pyramid level (we use 3000 pixels as lower bound for an image resolution of px) at the beginning of each iteration. If this is the case, we directly move to the next higher image resolution in the pyramid and compute there the optimization iteration.
We use the RGB color model with a quantization of 32 bins per channel to represent the tclc-histograms. The key idea to efficiently build and update the localized appearance model is to process each histogram region in parallel on the CPU using Bresenham circles to scan the corresponding pixels. When updating the tclc-histograms we use learning rates of and , allowing fast adaptation to dynamic changes. Based on the results presented in , we choose the histogram region radius as px for an image resolution of px, regardless of the object’s distance to the camera.
The reason why we are using a fixed radius is that for continuous tracking, we can only compute the segmentation within the histogram regions belonging to the silhouette in the previous frame. Thus, in cases of fast translational movement of the object or rotation of the camera it is possible that the object in the current frame projects entirely outside the previous histogram regions. This becomes more likely for smaller histogram radii. Therefore, as the radius shrinks with distance to the camera, the object is more likely to get lost at far distances. Non overlapping histogram regions in case of close distances and sparse surface sampling hardly influence the reliability of our approach.
The other extreme case is when the object is so far away that all histograms overlap. Here the discriminability of the appearance model is reduced since all pixels then lie within all histograms. The segmentation model then acts like the global approach constrained to a local region around the silhouette extended by the histogram radius. However this is still better than the global model and in our experience to be preferred over the increased risk of loosing the object easily with smaller radii.
After each pose optimization we compute the new 2D histogram centers by projecting all mesh vertices of each model onto pixels . In practice we consider those with as well as (we use ), in order to ensure that the contour is evenly sampled. For runtime reasons, since this can lead to a large number of histograms that have to be updated, we randomly pick a maximum of 100 centers per frame. This Monte Carlo approach requires the mesh vertices
to be uniformly distributed across the 3D model in order to evenly cover all regions. They should also be limited in number to ensure that all histograms will get updated regularly. We therefore use two different 3D mesh representations for the model. The original mesh is used to render exact silhouette views regardless of the mesh structure while a reduced (we use a maximum of 5000 vertices) and evenly sampled version of the model is used for computing the 2D centers of the histograms.
4.4 Occlusion handling
As previously shown in Figure 5, when tracking multiple objects simultaneously, mutual occlusions are very likely to emerge. These must be handled appropriately on a per pixel level for pose optimization. In our approach occlusions can be detected with help of the common silhouette mask due to the -buffer OpenGL. Thus, the respective contours computed directly from , can contain segments resulting from occlusions that are considered in the respective signed distance transform. To handle this, for each object all pixels with a distance value that was influenced by occlusion have to be discarded for pose optimization (see Figure 7).
A straight-forward approach as realized in  is to render each model’s silhouette separately as well as the common silhouette mask . The signed distance transforms are then computed from the non-occluded where is only used to identify whether a pixel belongs to a foreign object region and thus has to be discarded.
Although this strategy is easy to compute it does not scale well with the number of objects since , and have to be rendered and transferred to host memory in each iteration. In order to minimize rendering and memory transfer we follow the approach of . Thus, we instead render the entire scene once per iteration and download the common silhouette mask and the respective depth-buffer . The individual level-sets are then directly computed from . In addition to this we only have to render each model once separately in order to obtain the individual reverse depth buffers . This is not possible in a common scene rendering because the reverse depths of the occluded object would overwrite those of the object in front.
By only using and the detection of pixels with a distance value that was influenced by occlusion is split into two cases. For a pixel outside of the silhouette region i.e. , we start by checking whether equals another object index. If so, is discarded if also the depth at of the other object is smaller than that of the closest contour pixel to , meaning that the other surface is actually in front of the current object (indicated with dark red in Figure 7). For inside of the silhouette region i.e. we perform the same checks for all neighboring pixels outside of to the closest contour pixel to . If any of these pixels next to the contour passes the mask and depth checks, is discarded (indicated with bright red in Figure 7).
5 Experimental Evaluation
In the following we present both quantitative and qualitative results of our approach in several different experiments. We start with an exemplary comparison between first-order and second-order optimization. This is followed by a comprehensive evaluation of tracking success rates in our novel dataset as well as complex mixed reality application examples. For all of these experiments we evaluate our implementation on a laptop with an Intel Core i7 quad core CPU @ 2.8 GHz and an AMD Radeon R9 M370X GPU.
5.1 Comparison to PWP3D
In  the advantages of the Gauss-Newton-like second-order strategy in comparison to the first-order gradient descent method of the original PWP3D  were demonstrated. Essentially, the here proposed true Gauss-Newton optimization has similar convergence properties as the previous Gauss-Newton-like one. We therefore include an exemplary experiment from  in order to illustrate the general difference between first-order and second-order optimization in this context (see Figure 8). Note that both approaches use the same appearance model based on global color histograms.
In the selected experiment a cordless screwdriver was tracked while being moved in front of a stationary camera. The results show the dependence of the step-sizes in PWP3D on the distance to the camera. Here it holds, if the distance between object and camera becomes too small, the step-sizes are too large and the pose starts to oscillate. However, if this distance increases the overall optimization quality degrades, resulting in the step sizes to be too small to converge.
The sequence contains a challenging full turn around the -axis of the screwdriver (e.g. frames 180–400). For this the rotation step-size for PWP3D was set to a large value such that it was close to oscillating (e.g. frames 105–150) since this produced the best overall results. While the Gauss-Newton-like strategy is able to correctly track the entire motion, PWP3D fails to determine the rotation of the object despite the large step-size for rotation. Starting at around frame 450, the screwdriver was moved closer towards the camera, leading to a tracking loss of PWP3D at frame 586, while the proposed method remained stable.
Due to these essential drawbacks we did not include PWP3D in the quantitative evaluation within our new complex dataset in Section 5.3. It would require to manually set three different step-sizes individually for each object and we did not see a chance of it performing competitively.
5.2 The RBOT Dataset
We call the proposed semi-synthetic monocular 6DOF pose tracking dataset RBOT (Region-based Object Tracking) in regard to the proposed method. We have made it publicly available for download under: http://cvmr.info/research/RBOT. It comprises a total number of eighteen different objects, all available as textured 3D triangle meshes. In addition to our own model of a squirrel clay figurine, we have included a selection of twelve models from the LINE-MOD dataset  and five from the Rigid Pose dataset  as shown in Figure 9.
For each model we have generated four variants of a semi-synthetic image sequence with increasing complexity (see Figure 10). The first regular variant contains a single object rendered with respect to a static point light source located above the origin of the virtual camera. This simulates a moving object in front of a static camera. The second variant is the same as the first but with a dynamic light source in order to simulate simultaneous motion of both the object and the camera. The images in the third variant were also rendered with respect to the moving light source and further distorted by adding artificial Gaussian noise. Finally the forth variant contains an additional second object (the squirrel) which orbits around the first object and thus frequently occludes it. These multi-object sequences also include the dynamic light source.
Regardless of the object and the variant, we animated the model using the same pre-defined trajectory of continuous 6DOF motion in all sequences. We also always use the same background video for compositing. This video was taken by moving a hand-held camera arbitrarily in a cluttered desktop scene. In order to increase realism, we rendered the models with anti-aliasing and blurred the object regions in the composition using a Gaussian kernel. The latter in particular smooths the transition between the object and the background, blending it more realistically with the rest of the scene. Each sequence contains 1001 RGB frames of px resolution, where the first frame is always used for initialization. This results in a total number of color images. For each frame we provide the ground truth poses for the two objects as well as the intrinsic camera matrix used for rendering which we obtained from calibrating the camera that recorded the background video.
We denote the sequence of ground truth poses by , composed of and with . Starting at , for each subsequent frame we compute the tracking error separately for translation and rotation
for each object to evaluate the tracking success rate. If is below 5 cm and below , we consider the pose to be successfully tracked. Otherwise, if one of the errors is not within its boundaries, we consider the tracking to be lost and reset it to the ground truth pose, i.e., . In our experiments we evaluated the multi-object sequences in two different ways. We either track only the pose of the first (varying) object or both of them (the varying object and the occluding squirrel). When tracking only one of the objects, the occurring occlusions are unmodelled as we call it here. These are much harder to handle than modelled occlusions, where the pose and geometry of the occluding object is known, in case of tracking both objects.
5.3 Discussion of the Results
In Table I we present the tracking success rates of the proposed method and the method presented in  for all sequences in all variants. The results show that the novel re-weighted Gauss-Newton optimization outperforms the previous Gauss-Newton-like strategy in most cases by a large margin. The experiments also demonstrate the robustness of the appearance model based on tclc-histograms towards a moving light source. For both methods, compared to the regular scenario, the performance often even improves in the dynamic light variant and hardly deteriorates otherwise. However, it can also be seen that both approaches perform significantly worse for objects with ambiguous silhouettes (e.g. Baking Soda, Glue and Koala Candy) and struggle more with image noise in case of objects of a less distinct color (e.g. Camera, Can, Cube, Egg Box etc.).
5.4 Applications to Mixed Reality
Due to its low run-time and high accuracy our approach is very suitable for mixed reality applications. The ability to track multiple objects enables immersive visualizations in highly dynamic scenarios. Here, the virtual augmentations can realistically be occluded by the real objects since their geometry and poses are accurately known (see Figure 11). What makes it even more attractive for mixed reality systems is, that our approach can handle fair amounts of ummodelled occlusion e.g. by hands. Therefore, object-specific augmentations will remain in place while a user inspects objects by manipulating them manually. This further allows to turn arbitrary objects into 6DOF motion input devices.
With this work we have closed two gaps in literature on 6DOF object pose tracking. Firstly, we have provided a fully analytic derivation of a Gauss-Newton optimization that was originally lacking in . It is derived in form of a reweighted nonlinear least squares estimation. A systematic quantitative evaluation in Table I shows that the resulting update scheme leads to significant improvements over . Secondly, we have presented and created a novel large dataset for object tracking that covers practically relevant scenarios beyond prior comparable work. We believe that the community will benefit from both of these contributions.
However, regardless of the employed optimization strategy the presented approach only relies on the objects’ contours to determine their poses. It is therefore prone to fail for objects with ambiguous silhouette projections such as bodies of revolution (e.g. Baking Soda and Koala Candy from the dataset). To cope with this restriction and further improve tracking accuracy a photometric term could be incorporated in the cost function with regard to the objects’ texture. Assuming an object is sufficiently textured this should resolve the silhouette ambiguity in many cases.
Part of this work was funded by the Federal Ministry for Economic Affairs and Energy (BMWi). We thank Stefan Hinterstoisser and Karl Pauwels for letting us re-use their 3D models for our dataset. DC was supported by the ERC Consolidator Grant 3DReloaded.
-  C. Bibby and I. D. Reid. Robust real-time visual tracking using pixel-wise posteriors. In Proceedings of the European Conference on Computer Vision (ECCV), 2008.
-  T. Brox, B. Rosenhahn, J. Gall, and D. Cremers. Combined region and motion-based 3d tracking of rigid and articulated objects. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 32(3):402–415, 2010.
-  C. Choi and H. I. Christensen. RGB-D object tracking: A particle filter approach on GPU. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2013.
-  D. Cremers, M. Rousson, and R. Deriche. A review of statistical approaches to level set segmentation: Integrating color, texture, motion and shape. International Journal of Computer Vision (IJCV), 72(2):195–215, 2007.
-  S. Dambreville, R. Sandhu, A. J. Yezzi, and A. Tannenbaum. Robust 3d pose estimation and efficient 2d region-based segmentation from a 3d shape prior. In Proceedings of the European Conference on Computer Vision (ECCV), 2008.
-  P. F. Felzenszwalb and D. P. Huttenlocher. Distance transforms of sampled functions. Theory of Computing, 8(1):415–428, 2012.
-  J. Hexner and R. R. Hagege. 2d-3d pose estimation of heterogeneous objects using a region based approach. International Journal of Computer Vision (IJCV), 118(1):95–112, 2016.
-  S. Hinterstoisser, S. Benhimane, and N. Navab. N3M: natural 3d markers for real-time object detection and pose estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV)