Depth-Based Bayesian Object Tracking Library
We consider the problem of model-based 3D-tracking of objects given dense depth images as input. Two difficulties preclude the application of a standard Gaussian filter to this problem. First of all, depth sensors are characterized by fat-tailed measurement noise. To address this issue, we show how a recently published robustification method for Gaussian filters can be applied to the problem at hand. Thereby, we avoid using heuristic outlier detection methods that simply reject measurements if they do not match the model. Secondly, the computational cost of the standard Gaussian filter is prohibitive due to the high-dimensional measurement, i.e. the depth image. To address this problem, we propose an approximation to reduce the computational complexity of the filter. In quantitative experiments on real data we show how our method clearly outperforms the standard Gaussian filter. Furthermore, we compare its performance to a particle-filter-based tracking method, and observe comparable computational efficiency and improved accuracy and smoothness of the estimates.READ FULL TEXT VIEW PDF
The problem is target motion analysis (TMA), where the objective is to
Most of the correlation filter based tracking algorithms can achieve goo...
Depth information provides a strong cue for occlusion detection and hand...
Particle Filter(PF) is used extensively for estimation of target Non-lin...
Discriminative correlation filters (DCF) have recently shown excellent
Current convolutional neural networks algorithms for video object tracki...
We propose a framework called ReFInE to directly obtain integral image
Depth-Based Bayesian Object Tracking Library
Object Tracking Dataset v1 for Depth-Based Object Tracking
As a robot interacts with its surroundings, it needs to continuously estimate its own and the environment’s state. Only then, this estimate can provide timely feedback to low-level controllers or motion planners to choose the appropriate next movement. More concretely, for purposeful and robust manipulation of an object, it is crucial for the robot to know the location of its own manipulator and the target object. This information is not directly observable but has to be inferred from noisy sensory data.
In this paper, we address the problem of continuously inferring the 6-degree-of-freedom pose and velocity of an object from dense depth images. A 3D mesh model of the considered object is assumed to be known. Sensors yielding depth images, such as the Microsoft Kinect or the Asus Xtion, are widespread within robotics and provide a great amount of information about the state of the environment. However, working with such sensors can be challenging due to their noise characteristics (e.g. presence of outliers) and the high dimensionality of the measurement. These difficulties have thus far precluded the direct application of Gaussian filtering methods to image measurements.
. The most well known members of the family of GFs are the Extended Kalman Filter (EKF) and the Unscented Kalman Filter (UKF) .
While GFs are the preferred estimation method for low-level sensors, they have rarely been applied directly to image data. Instead, a large variety of different approaches has been applied to model-based 3D object tracking. A comprehensive overview is given by Lepetit and Fua . Some notable recent methods that rely on matching distinct features of the object model with features in the RGB measurement have been proposed in [8, 3]. With the availability of cheap RGB-D sensors such as the Microsoft Kinect or the Asus Xtion, the interest in the problem of object tracking from depth data has increased greatly. A number of methods minimize some energy function that captures the discrepancy between the current measurement and state estimate [18, 13, 21, 15]. Pauwels et al.  combine energy minimization given different visual cues with global feature-based re-initialization of the tracker.
Hebert et al.  provide an example of a GF-based method towards object and manipulator tracking. However, this method does not process the depth data directly. Instead, features are extracted from the depth images and then passed to the GF as measurements.
Among filtering methods, particle filters (PF)  are used much more often for state estimation from cameras. The PF is well suited for two reasons. First of all, it is non-parametric which makes it possible to model the fat-tailed noise which typically corrupts image measurements, see for instance [23, 5, 24]. Secondly, the heavy computational requirements can be offset by modeling each pixel as an independent sensor  and by parallelization in the particles [5, 4].
These two properties of PFs do not hold for GFs. The standard GF fails completely for fat-tailed-distributed measurements . This is a serious problem since outliers are very common in image data due to occlusions and other effects which are difficult to model. To address this problem, we use a robustification method for the GF which has recently been proposed in . The resulting method is far more robust to occlusions and other unmodeled effects than a standard GF.
Furthermore, naive application of the GF is very inefficient for high-dimensional measurements, and parallelization is not as trivial as for the PF. We use an observation model which factorizes in the pixels, similar to [5, 24]. The proposed method exploits this structure to reduce the computational complexity and to allow for parallelization.
These results extend the domain of application of the GF to problems which have previously been outside of its range. In many applications, it is preferable to use a GF instead of a PF, since the latter suffers from particle deprivation for all but very low dimensional state spaces. This issue manifests itself as jumps in the estimates computed by the PF. This is especially problematic when the estimate is fed into a controller, e.g. for visual servoing.
We apply the proposed method to tracking the 6-degree-of-freedom pose and velocity of an object using only depth measurements from a consumer depth camera. The proposed algorithm runs at the frame rate of the camera (30 Hz) on just one CPU core. A quantitative comparison between the proposed method and a PF-based method  is provided in the experimental section. We show that the proposed GF-based method yields smoother estimates.
The quantitative experimental analysis is based on an object tracking dataset which we recorded for this purpose. It consists of depth images recorded with an Asus Xtion camera, showing a person manipulating a set of rigid objects with different levels of occlusions and motion velocities. This dataset is annotated with the frame-by-frame ground-truth object pose as recorded with a Vicon motion capture system. This dataset as well as the implementation of the proposed method are publicly available .
In this section we briefly review filtering in general, and the GF in particular. This will serve as a basis for the derivation of the proposed algorithm in the following sections.
Filtering is concerned with estimating the current state given all past measurements . The posterior distribution of the current state can be computed recursively from the distribution of the previous state . This recursion can be written in two steps, a prediction step
and an update step
Since the prediction (1) does not admit a closed form solution for most nonlinear systems, the GF approximates the exact belief
with a Gaussian distribution
). The parameters are obtained by moment matching
When there is no analytic solution to these equations, they can be approximated using numeric integration methods. Such methods are efficient in this setting since samples from can be generated easily using ancestral sampling: first we sample from the previous belief , and then we sample from the process model .
Different numeric methods give rise to the different instances of the GF, such as the EKF, the UKF and the Cubature Kalman Filter (CKF) . In all the experiments of this paper, we use the Unscented Transform (UT)  for numeric integration. However, for the ideas in this paper, it is not relevant which particular integration method is used.
In most practically relevant systems, the update step (2) does not admit an exact solution either. The approach taken here is different from the prediction step, because it is rarely possible to sample from the posterior (2) efficiently .
Therefore, the GF instead approximates the joint distributionwith a Gaussian distribution [25, 20]
The parameters are obtained by minimizing the KL divergence between the exact and the approximate joint distributions 
These integrals can be approximated efficiently, since samples from the joint distribution can be generated easily. First we sample from the prediction distribution , and then we sample from the observation model .
For the tracking problem considered in this paper, the computational bottleneck is the update step, where the high-dimensional measurement is incorporated.
In the update step, we first approximate the integrals (8), (9) and (10) by sampling from the Gaussian prediction , as explained in the previous section. To sample from a Gaussian distribution requires computing the matrix square root of its covariance matrix (4), see  for instance. The computational cost of this operation is , where is the dimension of the state .
Once the integrals are approximated, the posterior (6) is obtained by conditioning on the measurement . This operation requires the inversion of the matrix , where is the dimension of the observation. This leads to a cubic complexity in the dimension of the observation as well 111Some matrix computations can be carried out with a slightly lower computational complexity than the required by the naive, element-wise approach. For instance, the Coppersmith-Winograd algorithm for matrix multiplication is . In practice, however, such algorithms only represent an advantage for very large matrices, and they would affect both the standard GF and our method the same way. For simplicity, we hence compute all computational complexities using standard matrix operations..
Hence, the computational complexity of the update step in the state dimension and in the observation dimension is .
In this paper, the state to be estimated is the position and the orientation of the tracked object. Since the orientation does not belong to a Euclidean space, we choose to represent the state of the system by a differential change of orientation and position with respect to a default position and orientation . This default pose is updated to the mean of the latest belief after each filtering step. This way we make sure that the deviations and remain small and can therefore be treated as members of a Euclidean space.
Additionally, we estimate the linear velocity and angular velocity of the object. The complete state is then
The sensor used is a depth camera. The observation is a range image . A range image contains at each pixel the distance from the camera to the closest object. It can be thought of as an array of beams emitted by the camera, where at each beam the distance traveled until the first intersection with an object is measured (Figure 1).
We treat each pixel as an independent sensor, the observation model therefore factorizes
The observation we would expect at a given pixel is the distance between the camera and the tracked object along the corresponding beam. This distance can be computed easily for a given state , since we have a mesh model of the object. The observation could be modeled as the depth corrupted with some Gaussian noise,
where is the covariance matrix of the noise, which we assume to be equal for all pixels.
However, the assumption of Gaussian noise is clearly violated in depth images. The noise of the sensors is much more heavy tailed. Even more importantly, there might be occlusions of the tracked object by other, unmodeled objects. The model above is not robust to such effects. Even a few occluded sensors can introduce huge estimation errors, as we show in Section VIII-A.
To address this problem, we introduce a second observation model, describing the case where the measurement is produced by such unmodeled effects. To express our ignorance about them, we choose the second observation model to be a uniform distributionover the range of the sensor, which is roughly 0.5–7 m for our depth sensor. The complete observation model is a weighted sum of the body model and the tail model
is the probability of the measurement being produced by unmodeled effects.
This simple extension is enough to account for outliers and occlusion, as shown in the experimental section. Similar observation models have been used in PF-based approaches to motion estimation from range data [23, 5, 24]. While standard PFs work with fat-tailed observation models, this is not the case for the GF. In Section IV, we apply a recent result  to enable the GF to work with this observation model.
The state transition model is a simple linear model,
The velocity is perturbed by Gaussian noise at every time step. It is then integrated into the position . Analogously, the angular velocity is perturbed by noise and then integrated into the orientation .
There are two problems which preclude the application of a standard GF to this model. First, the standard GF does not work with fat-tailed observation models . Second, the standard GF is too computationally expensive for this high-dimensional problem. In the following we address these two issues.
In the remainder of the paper, we only consider a single update step. For ease of notation, we will not explicitly write the dependence on all previous observations anymore; it is however implicitly present in all distributions. All the remaining variables have the same time index , which we can thus safely drop. For example, becomes and becomes , etc.
It is important to keep in mind that also the parameters computed in the following sections are time varying, all computations are carried out at each time step.
The GF depends on the observation model only through its mean and covariance, it is incapable of capturing more subtle features . This means that the observation model (14) is treated exactly as a Gaussian observation model with the same mean and covariance. The covariance of (14) is very large due to the uniform tail, so the standard GF barely reacts to incoming measurements.
In  we show that this problem can be addressed by replacing the actual measurement by a virtual measurement
with the time-varying parameters
Since the different pixels are treated as independent sensors, this feature is computed for each pixel independently.
The resulting algorithm, called the Robust Gaussian Filter (RGF) , corresponds to the standard GF where the observation models are replaced by virtual observation models , and the measurements from the sensor are replaced by the corresponding features . A step-by-step description of this procedure is provided in Section VI.
As explained in Section II-D, naive application of a GF to this problem would lead to a computational complexity of , with being the dimension of the state and the number of pixels of the depth image . Because is large, this complexity is prohibitive for real-time applications.
Since the observation model (14) factorizes in the measurements, the GF could be applied sequentially, incorporating the sensors one by one. We would carry out updates with a complexity of each, leading to a total complexity of . Unfortunately, this implementation can still be too slow for real-time usage. Furthermore, it cannot be parallelized because each update depends on the previous one.
In the following we derive an update for the GF with uncorrelated measurements which has a complexity of . Furthermore, the computations are independent for each sensor, making the algorithm eligible for parallelization.
The GF never explicitly computes an approximate observation model . Nevertheless, such a model is implied by the joint approximation . For the subsequent derivation, it is necessary to make explicit. The objective function (7) can be written as
The joint distribution is assumed to be Gaussian in the GF. This implies that is Gaussian, and that is Gaussian as well with the mean having a linear dependence on . The parameters of and are independent, since any two distributions will form a valid joint distribution . Therefore, the two terms in (22) can be optimized independently.
Since the predicted belief is Gaussian, we can fit it perfectly , resulting in .
The right-most term is the expected KL-divergence between the exact observation model and the approximation . This makes intuitive sense: the observation model only needs to be approximated accurately in regions of the state space which are likely to be visited. Please note that so far we have merely taken a different perspective on the standard GF, no changes have been made yet in this section.
Interestingly, the fact that the exact observation model (14) factorizes in the measurement does not imply that the optimal approximation will do so as well. The only change we apply to the standard GF in this section, is to impose the factorization of the exact distribution on the approximate distribution .
Inserting the factorized distributions into (22), and making use of , we obtain
The objective function is now a sum over the expected KL-divergences between the approximate and exact observation models for each sensor. Each of these summands can be optimized independently. The computational complexity is reduced and parallelization becomes possible.
The joint distribution being Gaussian implies that the conditionals have the form
We have approximated the observation models for each sensor by a linear Gaussian distribution (24). No further approximations are required: finding the approximate posterior merely requires some standard manipulations of Gaussian distributions. The complete observation model is the product of all the individual observation models
Having and we can now apply Bayes rule to express . This operation is simple for Gaussian distributions, and we obtain
with the parameters
Note that for both the precision matrix and the mean of the posterior, the contributions of each sensor are simply summed up. Unlike for the sequential GF, these contributions can be computed independently for each sensor, which allows for parallelization.
The prediction step of the proposed method is identical to the one in the standard GF and will therefore not be discussed. A summary of the update step is presented in Algorithm 1. The key advantage of the proposed method is that the computation of the matrix square root of has to be performed only once. If we were to update sequentially in the pixels, we would have to compute the square root of the updated covariance matrix each time after incorporating a pixel. The proposed method, in contrast, generates state samples only once, outside of the loop over the pixels, see Algorithm 1.
The remaining computations inside of the loop are of computational complexity . This leads to an improved computational complexity over the sequential version .
Furthermore, the computations inside of the for-loop of Algorithm 1 are independent, allowing for parallelization, which is not possible with the sequential GF. While we did not exploit this possibility in the experiments in this paper, it might become important as sensors’ resolutions and rates keep increasing.
We compare the proposed method to the PF-based object tracker from . Before we present the results, we describe the experimental setup.
Due to the lack of datasets for 3D object tracking using depth cameras, we created a dataset which is publicly available . It consists of 37 recorded sequences of depth images, acquired by an Asus XTION sensor. Each of these sequences is about two minutes long and shows one out of six objects being moved with different velocities. To evaluate the tracking performance at different velocities, the dataset contains sequences with three velocity categories as show in Table I. Additionally, the dataset contains sequences with and without partial occlusion. The camera-object distance range is between 0.8m and 1.1m.
|Velocity 1||Velocity 2||Velocity 3|
|Translational||5 cm/s||11 cm/s||21 cm/s|
|Rotational||10 deg/s||25 deg/s||50 deg/s|
The following parameters were chosen empirically for a processing rate of 30Hz. The process model has a
mm standard deviation in translational noiseand rad standard deviation in rotational noise . The observation noise was set to mm standard deviation. The tail probability was set to . The RGF is not sensitive to the tail probability as discussed in .
To achieve real-time performance (30Hz) on a single CPU core, we used a downsampling factor of 10 for the depth image with a VGA resolution of pixels. This results in a 3072 dimensional observation. For numeric integration we use the canonical Unscented Transform with the parameters , , and . The implementations of both the proposed method and the PF-based method are available in the open-source software package Bayesian Object Tracking .
In the following experiments we show the importance of robustifying the GF, an example of tracking under heavy occlusion, and a quantitative evaluation of the tracking performance on the full dataset of the proposed method compared to a PF-based tracker . A summary video illustrating the tracking performance is available at https://www.youtube.com/watch?v=mh9EIz-Tdxg.
We anticipated in Sections III and IV that the standard GF does not work with heavy-tailed measurements. Instead, the RGF can handle this kind of measurement noise, and therefore allows to take a GF-based approach to filtering using depth data. To illustrate this empirically, we compare the proposed method to a standard GF. The observation model in the proposed method is given by (14). For the standard GF we choose the same observation model without its tail. In practice this is achieved by executing exactly the same method, with a tail weight .
We use as input data a recording of the impact wrench (Figure 1) standing still on the table. After a couple of seconds, a partial occlusion occurs.
We run both filters and calculate translational error and angular error between the estimated poses and the ground-truth poses. In Figure 3 we show these errors over time.
Before the occlusion occurs, both filters have very similar performance. Then, in the presence of outliers, the standard GF reacts strongly and its error increases abruptly, while the proposed method remains steady and its error low for the whole sequence.
Here compare the proposed GF-based method to the PF-based tracker proposed in . We run both on a sequence showing an impact wrench standing on a table, while being partially occluded. After a couple of seconds, there is a total occlusion lasting several seconds.
Again, we calculate translational and angular error with regards to ground-truth annotations. Figure 4 shows tracking error over time.
Both approaches perform well in the presence of partial occlusion, in the beginning of the sequence. However, the zoom on the right-hand side of the figure reveals a key advantage of GF-based methods over PF-based methods: The RGF is much more accurate and more steady. This is particularly important when this estimate is to be used for controlling a robot.
When the object becomes entirely occluded, the PF starts drifting randomly due to its stochastic nature. The RGF however reacts in a more deterministic manner: When only outliers are received, these are soft-rejected and the estimate simply reverts to the prediction , and remains therefore steady.
For this experiment, we run the proposed method and the PF on the 37 sequences of the dataset. In the case of the PF, we aggregate the results of 30 independent runs. This is not necessary for the robust GF since the implementation is based on the UT and is therefore deterministic.
Both trackers are reset to the ground-truth after every 5 seconds to avoid comparisons between the two while one has lost track.
At each time step, we evaluate the translational and the angular error as in previous experiments. The results are summarized in Figure 5, which shows the distribution of the errors for both methods. We can see that the RGF yields more accurate estimates. However, the proposed method looses track a little more often than the PF-based method. This only happens very rarely and is therefore not visible in the density in Figure 5. Nevertheless, these rare large errors lead to a higher mean error, see Table II. However, the median error of the proposed method is smaller than the one of the PF-based method.
Summarizing, we can say that the PF-based method  is a little more robust, but that the proposed GF-based method is more accurate and yields smoother estimates. The properties of the two filters are complementary. For tracking of fast motion the PF-based method might be preferable, but for accurate visual servoing the proposed method is more suitable.
|Translational error mean [mm]||0.03008||0.00956|
|Translational error median [mm]||0.00489||0.00556|
|Angular error mean [deg]||0.39644||0.29044|
|Angular error median [deg]||0.06076||0.07146|
In this paper we apply a Gaussian Filter to the problem of 3D object tracking from depth images. There are two major obstacles to applying the GF to this problem: First of all, the GF is inherently non-robust to outliers, which are common in depth data. We address this problem by applying the robustification method proposed in .
Secondly, the complexity of the standard GF is prohibitive for the high-dimensional measurements obtained from a depth camera. We propose a novel update which has a complexity of , where is the number of sensors, and is the dimension of the state space. Furthermore, the proposed update is eligible for parallelization since the information coming from each pixel is processed independently.
The experimental results illustrate the advantage of GF based methods over PF-based methods. They provide often smoother and more accurate estimates. These are very important properties if the estimates are used to control a robot.
Bayesian Reasoning and Machine Learning. Cambridge University Press, 2012.
Asian Conference on Computer Vision, 2012.
Gaussian filters for nonlinear filtering problems.IEEE Transactions on Automatic Control, 2000.