Upper body joint tracking for the RWTH-Phoenix signer database using a Mixture Kalman filter
Traditional approaches to upper body pose estimation using monocular vision rely on complex body models and a large variety of geometric constraints. We argue that this is not ideal and somewhat inelegant as it results in large processing burdens, and instead attempt to incorporate these constraints through priors obtained directly from training data. A prior distribution covering the probability of a human pose occurring is used to incorporate likely human poses. This distribution is obtained offline, by fitting a Gaussian mixture model to a large dataset of recorded human body poses, tracked using a Kinect sensor. We combine this prior information with a random walk transition model to obtain an upper body model, suitable for use within a recursive Bayesian filtering framework. Our model can be viewed as a mixture of discrete Ornstein-Uhlenbeck processes, in that states behave as random walks, but drift towards a set of typically observed poses. This model is combined with measurements of the human head and hand positions, using recursive Bayesian estimation to incorporate temporal information. Measurements are obtained using face detection and a simple skin colour hand detector, trained using the detected face. The suggested model is designed with analytical tractability in mind and we show that the pose tracking can be Rao-Blackwellised using the mixture Kalman filter, allowing for computational efficiency while still incorporating bio-mechanical properties of the upper body. In addition, the use of the proposed upper body model allows reliable three-dimensional pose estimates to be obtained indirectly for a number of joints that are often difficult to detect using traditional object recognition strategies. Comparisons with Kinect sensor results and the state of the art in 2D pose estimation highlight the efficacy of the proposed approach.READ FULL TEXT VIEW PDF
Upper body joint tracking for the RWTH-Phoenix signer database using a Mixture Kalman filter
Reliable human pose estimation is a frequently encountered computer vision task, often required for successful vision-based gesture or action recognition systems. Specifically, our goal is to perform gesture recognition for human-robot interaction, which requires the 3D positions of human upper bodies to be tracked. Unfortunately, this is a particularly challenging problem, especially in cluttered environments with potentially moving cameras.
2D information typically suffices if only static gestures are to be recognised, but 3D information is required for most temporal gesture recognition solutions [Wu and Huang, 1999]. Multiple camera motion capture systems can provide 3D measurements with a high level of accuracy, but often require that users wear markers that aid in detection. Stereo camera vision allows for relatively accurate 3D spatial information to be obtained and as a result is commonly used for temporal gesture recognition. This is evidenced by the gesture recognition schemes of Triesch and Von Der Malsburg , Lee  and Nickel and Stiefelhagen , which all use stereo vision systems to observe gestures. 3D information can also be obtained using structured light systems such as the Kinect or PrimeSense depth sensor. Unfortunately, while the Xbox Kinect skeleton tracker Shotton et al.  is extremely effective, in many applications, where payloads are limited, this is infeasible, and a body tracking solution relying only on monocular vision would be preferred.
This paper aims to solve the 3D upper body pose estimation problem using images obtained by only a single camera. We propose a novel upper body model, trained using Kinect pose priors and designed with analytical tractability in mind. We show that pose tracking using this model can be Rao-Blackwellised using the mixture Kalman filter, allowing for computational efficiency while still incorporating bio-mechanical properties of the upper body. The model is used within a recursive Bayesian framework to provide reliable estimates of user head, neck, shoulder, elbow and hand locations when only a subset of body joints can be detected.
Face detection is used to determine head position, and provides a skin colour prior that assists in locating hands. Edge-based error correction is proposed to correct potential hand association errors before head and hand measurements are used to estimate upper body pose.
The paper is organised as follows. Section 2 discusses related work and provides some background to the problem. This is followed by a description of Bayesian filtering for human pose estimation, the introduction of our body models and various tracking algorithms that can be used with these in section 3. A comparison of the trackers is provided in section 4, before we describe how we obtain head and hand measurements from images in section 5. Results obtained when these measurements are used in conjunction with our model and body tracker are presented in section 6, along with a comparison with a recent 2D pose estimation approach [Eichner et al., 2012]. Finally, conclusions are provided in section 7.
Effective human pose estimation is required for successful vision-based gesture recognition systems to be deployed. This section describes various approaches to human pose estimation, within the context of gesture recognition.
A vast amount of work has been conducted in the field of human pose estimation using monocular vision. Two approaches to pose estimation from static images have emerged, the first relying on tracking and generative models, and the second on morphological recognition. Morphological recognition techniques can be top-down, where entire bodies are recognised, or bottom-up, where bodies are recognised by locating various body parts or components. Gavrila and Davis  use a top-down, search-based technique to locate poses by matching contours or edges formed using a generative body model with those in an input image.
A number of top-down approaches rely on matching extracted silhouettes to a known database. This technique is applied by Germann et al.  who refine matched pose estimates using a set of 3D body part constraints. This approach relies on multiple cameras though, and the extraction of silhouettes, which can be challenging. Further, the authors note that additional information is required to estimate poses where the arms are close to the body, as silhouettes do not contain sufficient information to do so.
The dominant approach to pose estimation is bottom-up [Yang and Ramanan, 2011], using a pictorial structure of body parts with geometric constraints modelling component interactions. Yang and Ramanan  use a family of affinely warped templates and a mixture model capturing contextual relations and produce good pose estimation results at approximately 1 frame a second. A pictorial structure model is also used by Eichner et al. , who detect bodies using a part-based model, segment these bodies using Grabcut [Rother et al., 2004] and then fit appearance models trained previously using labelled data. This approach also provides good performance, but can be slow, and only works on near frontal and rear viewpoints.
A number of pose estimation techniques use segmentation to locate and extract human bodies. In their work on pose estimation for sign-language videos, Charles et al. 
leverage the layering of signers on video to extract bodies using co-segmentation, before estimating joint locations using a selection of random forests trained on a number of previously segmented bodies (labelled using the work ofBuehler et al. ). Unfortunately, accurate segmentation is slow on general video sequences and not usually feasible for real time applications.
Bottom-up approaches to pose estimation are also used in tracking-based pose estimation approaches. Lee and Cohen 
used a 21 degree-of-freedom generative model of human kinematics, shape and clothing in a data-driven Markov chain Monte Carlo search. Here, visual cues of the face, head and shoulder contours, skin blobs and arm ridges were used to aid importance sampling and drive a Monte Carlo search to feasible 3D pose candidates. Unfortunately, estimating 3D pose estimation from static 2D images results in a number of pose ambiguities as different body configurations can appear similar when viewed from different points.
Many pose estimation techniques rely on Monte Carlo simulation or particle filtering. Particle filters represent the posterior belief in a state, conditioned on a set of measurements, by a set of random state samples drawn from this distribution. While the particle filter is able to approximate non-Gaussian noise distributions extremely well, it is computationally intensive as motion and observation models need to operate on multiple particles. Moreover, the memory requirements of particle filter algorithms are excessive, as the performance of the algorithm is dependent on the number of particles used.
In high dimensional state spaces, the effective number of particles required to approximate the posterior belief can become extremely large and the particle filter tends to operate as a traditional optimisation problem when a feasible number of particles is used. In these cases, additional information is often required to constrain the search space and produce good particle estimates.
Sminchisescu and Triggs  note that many particle filtering algorithms for 3D pose estimation often require the addition of extra noise to assist in the search for minima. They attempt to resolve this by using a complex body model and through careful design of the observation likelihood function, incorporating priors on the anthropometric data of internal proportions, parameter stabilisers, joint limits, and body part penetration avoidance. They also apply covariance scaled sampling to direct the search, which involves combining assumed dynamics with the posterior distribution and growing the prior covariances to sample more broadly. This search can be sped up through the addition of kinematic reasoning to assist in the sampling, reducing the number of possible solutions to a pose if the lengths of limbs are known [Sminchisescu and Triggs, 2003].
Jauregui et al.  also apply kinematic reasoning to aid in pose estimation, but use a silhouette-based observation model. Here, silhouettes are extracted using background subtraction, faces detected and a skin colour model learned. A clothing colour model is also learned, using an image patch directly below the face. These colours are then used when projecting a generative 3D body model, which is compared to the thresholded body.
Deutscher et al.  have proposed the use of simulated annealing to solve the high dimensional search problem associated with 3D pose estimation. Here, a set of weighting functions are used to drive the particle filter search to possible solutions. Davison et al.  perform 3D tracking using multiple cameras and a simulated annealing search. In this case, generative body models are used to create edge and foreground templates, which are compared to those observed using a sum of squared distances metric.
The difficulties in 3D pose estimation from 2D images have led some researchers to focus on 2D pose estimation in images, a slightly better posed problem. Hua et al.  apply Markov chain Monte Carlo estimation to fit a set of 2D quadrangles to humans in images, using an observation model combining colour measurements of the head and hands (learned after face detection), and line segments extracted from the torso.
Applying Monte Carlo search techniques to pose estimation has the benefit of allowing a number of constraints and priors to be incorporated. However, the large number of constraints and complex models required to direct the high dimensional search is hardly ideal, and somewhat inelegant, resulting in large processing burdens. The incorporation of these constraints through priors obtained directly from training data is proposed here, in an attempt to simplify the sampling stages.
The process of learning constraints from training data has been advocated by Yu et al. , who clustered 3D body positions according to various action categories, then used action recognition and 2D body parts detected using a deformable part model to predict 3D pose with a random forest. The use of action recognition restricts the possible pose search space, allowing for faster and more accurate pose estimation.
Howe et al.  used 3D motion capture data to train a Gaussian mixture model prior, which when combined with a Gaussian error model of 2D tracked body parts allows a 3D pose estimate to be computed using Expectation maximisation. Our approach is similar to this as it also uses a Gaussian mixture prior to incorporate body constraints, but differs through the inclusion of temporal motion tracking using recursive Bayesian estimation. In addition, only a subset of body parts need to be detected for body tracking. A description of our pose estimation method follows.
Assuming the human body can be modelled as an unobserved Markov process with a set of joint states at time , recursive Bayesian estimation allows states to be updated as measurements are made.
Here, is a normalising constant and the nomenclature refers to the collection of states from time step 1 to t. This process allows for continual state estimation that includes temporal information, using a transition model to predict state changes and an observation model to introduce measurement information.
For human body tracking, the state vectorcould comprise the 3D positions of all joints of interest, camera position and orientation, but this causes a number of estimation difficulties when only 2D image measurements obtained from a single camera measurements are available. In this case, image measurements are a non-linear function of the camera position and orientation, which complicates the tracking problem significantly.
This complication can be avoided by performing all filtering in the image plane and only returning to 3D coordinates when a state estimate is obtained. Let and be image coordinates of a body joint, , observed by a camera with 6 degree-of-freedom pose ,
Here, denotes an intrinsic camera calibration matrix,
with and focal distances and coordinates of the camera’s principal point.
Selecting a state vector comprising the scale parameter , image plane coordinates , and camera pose allows us to make direct comparisons between state and measurements. Once a state estimate is made, returning to 3D coordinates is trivial, with
and denoting the -th column vector of the projection matrix in (3).
We construct a transition model by combining a simple motion model with an objective function or prior:
This decomposition is useful as it allows a prior distribution covering the probability of a human pose occurring to be used to incorporate likely human poses into the motion model. This distribution is obtained offline, by fitting a Gaussian mixture model (GMM) to a large dataset of recorded human body poses. The positions of upper body joints of interest are tracked using a Kinect sensor [Shotton et al., 2011]. Recorded 3D joint positions are then projected into 2D, assuming a pinhole camera with a known camera calibration matrix, , and a random set of camera viewpoints within a set of constraints (, and ; and translation m). This provides a much larger set of recorded 2D joint positions. Figure 1 shows the original 3D recorded pose data, and the corresponding 2D pose data generated through the synthetic viewpoints is shown in Figure 2.
This large dataset is infeasible to work with, and so the Gaussian mixture model of this distribution is a useful form of dimension reduction. A more detailed description on GMMs and their training is provided in Appendix A. The Gaussian mixture model is denoted by , the probability of an upper body pose occurring,
Learning the GMM can be computationally intensive and a large number of mixture components may be required. This is remedied by assuming independent left and right arms, and training two mixture models instead.
It is unlikely that states will vary much between time steps, and so we use random walk to describe the motion between states: , with , or
The covariance matrix in (8) is assumed to be a diagonal matrix with each diagonal term selected empirically with image dimensions in mind.
The prior learned from the training data inherently contains kinematic constraints, as well as information on more commonly observed poses. It is also extremely compact and simple. Using this prior, and the fact that the product of two multivariate normal densities over random variableis another multivariate normal and scaling constant, we can write
As a result, the evidence can be computed as
This provides the final transition model
This model can be viewed as a mixture of discrete Ornstein-Uhlenbeck processes, in that states behave as random walk, but drift towards a set of typically observed mean poses.
The observation model used here is assumed to be a Gaussian centred about the difference between a subset of states and measurements,
The pose state contains the image positions of the head, neck, shoulders, elbows and hands, but it is assumed that only the head, neck and hand states can be measured. These measurements correspond to the subset of states used in the measurement model of (15), selected using . The covariance matrix is assumed to be a diagonal matrix with empirically selected diagonal terms, corresponding to a maximum measurement error in pixels, selected with image dimensions in mind.
An analytical solution to the integral in (1) is not always easily computed and often an approximation is required. One way of performing this is to approximate the target distribution using a discrete set of samples. Let
where the weights are chosen using importance sampling. Consider the full posterior distribution over all states and measurements, with initial estimate ,
For a Markov process, the current measurement is only dependent on the current state and the current state is only dependent on the previous state, so we can write
Constructing an importance density from which state samples are easily sampled provides importance weights
which can be written recursively as
Since we are only interested in the state at time , and desire an approximation to the density , we can discard the state history and the weight update equation becomes
Unfortunately, sequential importance sampling often suffers from degeneracy problems [Doucet et al., 2000], where the weights of most particles become negligible after a few iterations. This is remedied by resampling, which generates a new set of particles by sampling with replacement according to the importance weights. This typically eliminates particles that have small weights and adds emphasis to those with larger importance. Special care needs to be taken as to the selection of the proposal density . Ideally this should be as close to the target density as possible.
The sampling importance resampling (SIR) or bootstrap filter, discussed in detail by Ristic et al. , is frequently used for recursive Bayesian filtering. Here, the importance density is usually chosen to be equal to the transition density,
This reduces the importance weight calculation to
By applying resampling at each time step, the weights become uniform, and the weight update simplifies to
The SIR filtering procedure is described in more detail in Algorithm 1.
Resampling may be computationally expensive, so in practise it is not desirable to resample on each iteration. Instead, resampling need only occur when the effective number of particles is below a certain threshold, and the particle filter is close to degeneracy. An estimate of the effective number of particles used by a particle filter [Kong et al., 1994] is
Unfortunately, drawing samples from the Gaussian mixture model of (14) is rather computationally intensive. Sampling from this GMM requires
draws from a uniform distribution to select a mixture component according to the model’s mixture weights, and a furtherdraws from different Gaussians (due to the dependence of (14) on previous states) to select particles. As an alternative solution, we propose that samples be drawn from the far simpler density , which results in the weight update equation of
An even more efficient approximation could neglect the scaling term entirely, although this could potentially introduce evidence bias in the tails of the distribution. In the following section, we will show that ignoring this term is effectively equivalent to modifying the transition model such that a random walk is applied to each mixture component independently, as opposed to the entire distribution.
Particle filter tracking in high dimensions typically relies on good initial particle estimates. In an attempt to remedy this, we start with much larger joint variance along the diagonals ofin (8) and slowly reduce this over a burn-in period, to allow for an initial particle convergence phase. This can be considered a form of simulated annealing, which has been used previously for pose tracking by Deutscher et al. .
The particle filter is a useful approximation when dealing with complex probability distributions, which cannot be analytically integrated. However, the use of a Gaussian mixture model in the transition density and a conjugate Gaussian observation model allows us to Rao-Blackwellise the particle filter by performing integrations optimally using a number of Kalman filters to track mixture components, in a manner similar to that described byAlspach and Sorenson . This approach, termed the mixture Kalman filter, has been applied to a number of conditionally linear dynamic models by [Chen and Liu, 2000] and [Doucet et al., 2000].
Our goal is to calculate the posterior distribution, , given a sequence of measurements. Recall that a prior model on human pose, learned from Kinect training data, can be denoted by a weighted summation of Gaussians, with means and variances and respectively,
This distribution can be partitioned if we introduce an indicator variable , which refers to the
-th mixture component in the distribution. Then the prior probability over states can be denoted as
Applying the random walk transition density selected in (8) to each mixture component independently provides the transition density for the body pose conditioned on the indicator variable and previous state,
which can be solved analytically to provide the normal distribution
Assuming only a subset of states, , can be observed in the presence of zero-mean Gaussian measurement noise with covariance provides a measurement model,
Equations (32) and (33) are of the form required for optimal Bayesian filtering using the Kalman filter [Kalman, 1960]. The Kalman filter marginalises out historical states and provides the posterior distribution of a state for a given trajectory of indicator variables, , conditioned on a mixture component. Here, the boldface , with is used to denote the -th trajectory of mixture components, from time steps to . First, , a prediction of the state mean conditioned on a particular sequence of indicator variables up to time is made using the transition model of (32),
assuming no process noise, with
The existing uncertainty in the mixture component is propagated through the linear process model, and uncertainty in the model included, to provide the predicted mixture covariance,
When observations are made, the measurement and covariance residuals are calculated using
These residuals are then used to provide the updated mean and covariance estimates
where is the optimal Kalman gain for a linear system. Finally, the posterior density for the state conditioned on a trajectory of mixture components can then be described by a Gaussian,
Using this information, the probability of an indicator variable trajectory conditioned on the sequence of measurements, , can be used to obtain the target distribution
Here, denotes the number of indicator components in the motion model, and the number of indicator variable trajectories.
The conditional indicator probability is obtained by marginalising the joint state indicator distribution,
The contents of the integral in (44) are known, with the normal measurement model of (33) and the result of the Kalman filter prediction step, also Gaussian, which we shall denote as . As a result, (44) reduces to an iterative form
with a normalising constant.
Unfortunately the sums in (43) are hard to compute, as the number of trajectories grows exponentially with each filtering iteration, so in practice we approximate (43) as a weighted sum of trajectories of interest,
The mixture Kalman filter uses importance sampling to select the subset of trajectories, with weights updated using
when indicator variables are sampled from the proposal density, .
Using the sampled indicator variables and these weights, a maximum a posteriori estimate for the upper body pose can be obtained through a weighted combination of updated mixture means,
This pose estimate is easily calculated, typically requiring only a small number of parallel Kalman filters, so is far more efficient than a bootstrap particle filter approximation. Finally, a 3D human body pose is obtained by evaluating (5) at the estimated state.
In practice, many of the weights, , can become negligible after a few iterations, with only a few Gaussians contributing to the final pose estimate. This is remedied by resampling with replacement whenever the effective number of particles falls too low.
Importance sampling can be expensive, so a suboptimal approximation to (43) could be obtained by selecting a fixed set of trajectories by some other means. A number of mixture reduction schemes Salmond ,Blom  have been proposed previously, but many of these can be expensive. For example, trajectories could be selected by performing the update step for each possible mixture component and input trajectory, then discarding trajectories with low indicator weights. This approach is termed the split-track filter Smith and Winter . We propose that a subset of trajectories be selected by only retaining trajectories where and , which forces continuity between indicator variables and guarantees that every mixture component is fairly represented in the posterior distribution, in effect giving more weight to the prior distribution on human poses. Here, weights are updated using
As mentioned previously, weights can tend to zero for a given mixture component. Resampling in this case is not ideal, as it could become impossible for this mixture to contribute towards the pose estimate regardless of future measurements. This is undesirable as it effectively removes the mean-reverting properties of the process model. This is remedied by adding a small uniform prior, , to the weights on each iteration. The size of controls the speed at which the process model is able to transition between reverting to the different mixture means in the pose prior.
In the previous section, we introduced a motion model suitable for upper body tracking using recursive Bayesian estimation and discussed a selection of tracking schemes to perform this. The first, a bootstrap particle filter, makes proposals from the GMM transition model in (14) and uses the weight update equation in (23). This sampling step is quite time consuming, and the second, faster scheme discussed draws samples from the simple random walk in (8) for use with the weight update equation in (26). The third tracker neglects the scaling evidence term in the weight update equation of (26) to obtain an even faster approximation. Neglecting this term is equivalent to assuming independence across mixture components, or that the transition noise is added to each mixture component separately. The final two tracking schemes introduced also use this slightly modified transition model, where noise is added to each mixture component independently, to allow for an iterative solution using the mixture Kalman filter. The importance sampling step used to select indicator variables in this scheme can be time consuming, so an approximation using a deterministic set of indicator trajectories was also proposed, where each indicator variable selected is paired with a specific trajectory.
Results obtained after applying the five tracking schemes discussed to manually annotated image sequences are provided here. Each of the schemes was applied to image sequences with a moving person, and the pose estimates compared to those obtained using the Kinect motion tracker. Independent datasets were used to learn the pose priors and test the pose estimates. Figure 3 shows the mean pixel error for each joint over the test sequence.
No simulated annealing was used for the scheme sampling from the full Gaussian mixture model, as this required a larger level of noise in the transition model in order to avoid losing track of the joints completely. The figure shows that the best performance was obtained using the mixture Kalman filter (MKF) approaches. Of the particle filter approaches, the sampling scheme with no scaling converged and tracked the actual pose best, with rather poor tracking achieved when weighting was included. The theoretically preferred Gaussian mixture model sampling was unable to adequately track motion, presumably due to its slow convergence.
A commonly used metric that assesses the performance of 2D pose estimation algorithms is the probability of correct pose (PCP) [Yang and Ramanan, 2011], which shows the percentage of correctly localised body parts, where a body part is deemed to be correctly localised if its end points fall within some fraction of the ground truth body part length. Figure 4 shows the PCP curves for each of the various tracking schemes (only forearm and upper arm localisation is considered). This metric highlights the performance of the Mixture Kalman filters.
Figure 5 indicates the pixel errors obtained for each joint over the entire test period. Noticeable error spikes that occur when the particle filters are used are not present in the mixture Kalman filter results. The superior performance of the mixture Kalman filter approaches and the simple sampling scheme disregarding scaling make it is clear that the modified transition density of (32), where noise is added to each component independently, is a better model of human motion than that of (14).
Table 1 shows the average time taken for each filter iteration, when each of the suggested tracking schemes is used. It is clear that sampling from the full GMM is significantly more time consuming than the simple sampling, but that the mixture Kalman filters are far faster than all of the particle filter approximations.
|Simple sampling, with scaling (10000 particles)||0.046 s|
|Simple sampling, no scaling (10000 particles)||0.028 s|
|GMM sampling (10000 particles)||2.947 s|
|MKF (30 mixture components)||0.021 s|
|MKF, fixed tracks (30 mixture trajectories)||0.015 s|
Note that the mixture Kalman filter approximation using deterministically selected tracks provides almost identical performance to the MKF using sampled indicator variables, but is significantly faster. Qualitative results show that using the MKF with fixed tracks provides a much smoother tracking result (see accompanying videos). This deterministic MKF also appears to be better at dealing with uncommon scenarios such as raised arms (Figure 6), which is presumably due to the fact that all mixture components are paired with a specific trajectory, and as a result can always contribute to a pose estimate. In contrast, the original MKF will place emphasis on mixture components that carry more weight, and this effect will propagate until components of less weight become negligible. The MKF distribution obtained when sampling indicators may be closer to the true joint distribution, but appears less suited to providing a point estimate as a result, since it appears to be more susceptible to ambiguities in the pose estimation.
Thus far, manually annotated images have been used to compare pose estimation schemes. The process of detecting head and hand positions and incorporating these into the filtering framework is now described.
Automatic face detection is frequently required by computer vision systems and a large number of extremely effective algorithms are available to accomplish this. In this work, an OpenCV [Bradski, 2000] implementation of the well known Viola and Jones 
face detector is applied. This detector classifies faces using a cascade of boosted classifiers, trained using the responses to Haar-like features.
The face detector is trained over a wide selection of faces, but only frontal faces are used as positive training examples, in line with our end application of human-robot interaction. In these applications, a robot should only attempt to engage with a person who is looking directly at it, in the same way humans make eye contact when conversing.
The face detection is augmented through the addition of face tracking using a Kalman filter [Kalman, 1960] and constant velocity motion model, applying a modified version of the simple object tracker described by Burke . This tracking provides a degree of robustness to false negatives (faces present, but not detected), and can be used to reject false positives (faces detected, but not present) as these tend to be detected sporadically and fail to provide lasting tracks.
With each input image, detected faces are compared to tracked faces using a Euclidean norm distance measure, including the position and size (height and width) of the faces. If this measure falls below a certain threshold, the update stage of the Kalman filter is applied to the corresponding tracked face. If this is not the case, a new track is started. When faces have not been observed for a certain number of time steps, they are removed from the list of tracked faces. Similarly, tracked faces are only used if the track has lasted for a predefined length of time.
Once detected, faces contain important information, which can assist in the detection of other body parts. This section shows how the detected face can be used to determine the tracked person’s skin colour, and segment hands.
First, a histogram of the colours (Lab colour space) present in a square image patch bounding the detected face is back-projected to provide a likelihood map of image areas resembling skin. Here, back-projection refers to the process of evaluating the probability of an image pixel being skin coloured, with the likelihood approximated by a histogram of the detected pixel values in a training image patch. An exponentially weighted moving average filter favouring historical measurements is applied to the histogram to limit the effects of spurious lighting dependent observations.
Originally, a Gaussian mixture model was trained using this image patch and used for skin colour segmentation, but this proved computationally expensive, and provided little improvement over a simple back-projection. In order to assist in the recognition of hands, areas of high likelihood are only labelled as left or right hands when placed within an initialisation area, consisting of the left and right halves of the input image. This serves as the hand detection process. The hand likelihood image can contain unwanted static artefacts, due to skin coloured objects or shadows in the image. We can remove these artefacts by applying a background segmentation algorithm [Zivkovic, 2004], which classifies pixels as foreground or background objects using an adaptive per pixel Gaussian mixture model. This segmentation process labels static objects as background by maintaining a history of pixel values over frames, so assumes a static camera. As a result, this assumption may not be ideal for applications in mobile robotics. Fortunately, the background segmentation is not essential and can be removed for mobile applications, with only a slight degradation in qualitative hand detection results.
Immediately after initialisation, a mean-shift tracker [Bradski, 1998] is used to track the detected hands. Mean-shift locates the maxima of a likelihood function, in this case the re-projection likelihood obtained using the detected face, using discrete samples from the distribution. On each iteration, the original hand position is adjusted based on the mean-shift maxima. As hands typically form larger blobs than wrists, the mean-shift tracker tends to remain centred on hands, and does not typically move along skin coloured forearms. When combined with the initialisation process, this allows for relatively robust hand tracking. If hands are lost (the average likelihood in the tracked hand area drops below a predefined threshold), the user simply re-initialises the hand tracker by returning their hands to the original initialisation area.
The mean-shift tracker is unable to track rapidly moving objects particularly well, so is augmented through the use of a constant velocity Kalman filter tracker similar to that used for face tracking, which provides a predicted region of interest in which to search for a hand and improves the mean-shift tracking. Note that the predicted region could have been obtained by using the predicted body position in the mixture Kalman filter framework, but it turns out that the random walk motion model is not a very good predictor of hand positions, since it contains no velocity information. Figure 7 illustrates the detection and tracking process.
Unfortunately, the use of skin colour to detect hands leads to difficulties in discriminating between hands. This is alleviated somewhat by masking the image area predicted to contain the left hand when tracking the right hand, and vice versa with the left, but problems still occur when hands merge, or for clapping motions, where a constant velocity prediction causes hands to swap. Examples of these failures are shown in Figure 8. Micilotta and Bowden  have proposed the use of a GMM trained using prior pose estimates to disambiguate left and right hands, but this simply tends to identify hands as left if they are found to the left side of the head (and vice versa to the right), sometimes incorrectly rejecting instances where hands cross the body.
Errors resulting from incorrect hand association could be avoided by taking the orientation of the arms into account in the hand detection process. Unfortunately, it is quite difficult to detect arms, which can have highly variable appearances in images. However, once hands have been detected, we can assess the validity of a pose estimate using additional image features and use this to correct hand association errors. The MKF pose estimate contains the 2D position of each joint, and can be used to form a stick model similar to that drawn in Figure 8, with limbs described by a set of oriented edges. As a result, we propose that a natural measure of a pose estimate’s likelihood is one that uses orientation information from edges detected in the image.
Initially, an edge-based image representation is obtained using the Canny edge detector [Canny, 1986]. The probabilistic Hough line detector [Matas et al., 2000] is then used to detect linear edge segments. The number of edge segments providing support for a pose estimate or limb position is then used to decide if the correct hand association has been made, or if the pose estimate is in error. We use a Gaussian kernel to determine edge support, with edges considered as evidence for a given limb if the likelihood
is greater than some threshold . Here, is a vector of the edge orientation and the image position of a detected edge midpoint, while contains the position and orientation of the estimated limb. is a diagonal covariance matrix, with variances selected empirically to allow feasible position and angle offsets. Figure 9 illustrates the voting process for a given pose estimate.
The proposed heuristic allows for data association errors in hand measurement to be corrected relatively quickly, but does not prevent these errors from occurring in the first place. Direct measurement of limb positions should eliminate hand association errors of this type completely.
Results obtained when the head and hand detectors of Section 5 are used in conjunction with the mixture Kalman filter (deterministically selected tracks) are provided here. Figure 10 shows the mean error for each joint over a test sequence of more than 1000 images, when estimated 3D positions were compared with the skeleton output of a Kinect sensor, by aligning the head, neck and shoulders using fixed scale Procrustes analysis [Schönemann, 1966]. This comparison is not ideal, as the Kinect is not perfectly accurate and often fails when hands cross over the body, but it does provide an indication that the 3D pose estimate is plausible.
Figure 11 shows the position errors obtained for each joint over the entire test period, when compared with the positions obtained with the Kinect sensor. For much of the time, the error in hand position remains below 20 cm, with error spikes only occurring when the hands crossed the body or moved rapidly. A particularly encouraging result is that the average elbow errors remained quite low, even though no measurements of these joints were made at all. Qualitative results can be seen in the accompanying video, which also shows the edge-based error correction in operation.
A noticeable source of error involved uncommon poses that were not present in the prior training data. This should be remedied by additional training, but potentially at the expense of pose estimation accuracy in other poses.
Figure 12 shows the 2D tracking errors obtained by applying the Eichner et al.  2D pose estimation approach (the current state of the art in 2D pose estimation in unconstrained images) and our technique to a sequence of over 500 images, using Kinect joint tracks as ground truth. This sequence is particularly challenging for our approach as the study participant is wearing short sleeves, which could potentially result in hand detection failures due to skin coloured arm regions, and increases the risk of incorrect hand association. The latter occurs towards the end of the tracking sequence, resulting in large hand and elbow tracking errors that failed to be corrected immediately by the edge-based pose correction. The figure also shows the results of the MKF pose estimation when background subtraction is not applied, which are very similar to those obtained when this is included. In fact, additional noise in the likelihood map used for hand detection prevented the hand association failure that occurred when background subtraction was applied, resulting in overall improved performance, although a number of spurious pose estimates were observed instead.
It should be noted that the Eichner et al.  approach is at a disadvantage here as it does not incorporate temporal information, but it does perform far more processing, operating at approximately 0.5 frames a second. Our approach operates at just under 30 frames per second, with face detection the primary bottle neck.
In practise, the Eichner et al.  pose estimation performed well at upper arm detections, but typically failed at forearm detections, presumably due to the cluttered background used for experimentation. Figure 13 shows the PCP curves comparing the 2D pose estimation accuracy. Our approach shows significant improvement in detection rates for greater detection thresholds, while providing similar performance to Eichner et al.  over smaller thresholds. Once more, improved performance was seen when background subtraction was not applied, resulting from the absence of incorrect hand association in this test set, but a performance reduction is expected in cases where a number of skin-coloured objects are present in the image background. Qualitative results can be seen in the accompanying video, which provides a comparison with Kinect pose estimation and that of Eichner et al. .
This paper has provided results on upper body pose tracking using Kinect joint priors and simple hand and head measurements. Four tracking schemes have been considered and a mixture Kalman filter shown to provide effective upper body pose estimation. The use of the proposed upper body model allows reliable pose estimates to be obtained indirectly for a number of joints that are often difficult to detect using traditional object recognition strategies. The suggested model is designed with computational efficiency and analytical tractability in mind, yet still incorporates bio-mechanical properties of the upper body, typically only included using more complex body models.
Comparisons with the current state of the art in 2D pose estimation [Eichner et al., 2012] have shown that our approach outperforms this significantly, both in terms of estimation performance and time complexity. Good 3D tracking results were also exhibited during experimentation.
A mechanism for correcting hand data association errors has been provided, but these errors will continue to occur without the inclusion of additional joint measurements. Improved hand association is required if multiple humans are to be tracked at once. While good results have been obtained for a constrained set of camera viewpoints, additional priors and improved measurements may be required to resolve pose ambiguities if 3D position is required over a larger range of viewpoints. Future work will involve the inclusion of a better mechanism for detecting hands and evaluating the effects of including additional training data collected from multiple persons.
Sequential imputations and Bayesian missing data problems.Journal of the American Statistical Association, 89(425):278–288, 1994.
Many tracking applications require a suitable probabilistic model of prior and likelihood distributions. Gaussian mixture models (GMMs) are a popular choice of model for probability distributions due to their ability to approximate a wide variety of complex distributions with a limited number of parameters. Only a brief overview of GMMs is provided here, but readers are referred to Bishop  for additional information. These models are particularly useful in acquiring an analytical approximation to a probability distribution when only discrete samples from the distribution are available. Formally, a Gaussian mixture model is defined as
with parameters , and . denotes the length of the state vector . is symmetric and positive definite.
Training a GMM using discrete data is accomplished through expectation maximisation. Expectation maximisation is an iterative two step process obtaining the maximum likelihood estimation of parameters in a model. Assuming observations, start with an initial, random estimate of the model parameters and calculate the responsibility that the -th Gaussian takes for explaining an observation ,
This is termed the expectation step. The maximisation stage occurs by applying analytic estimators to maximise the likelihood of the data. Parameter is calculated as
The effective number of points assigned to the -th Gaussian in the mixture model is calculated as