Log In Sign Up

Single camera pose estimation using Bayesian filtering and Kinect motion priors

by   Michael Burke, et al.

Traditional approaches to upper body pose estimation using monocular vision rely on complex body models and a large variety of geometric constraints. We argue that this is not ideal and somewhat inelegant as it results in large processing burdens, and instead attempt to incorporate these constraints through priors obtained directly from training data. A prior distribution covering the probability of a human pose occurring is used to incorporate likely human poses. This distribution is obtained offline, by fitting a Gaussian mixture model to a large dataset of recorded human body poses, tracked using a Kinect sensor. We combine this prior information with a random walk transition model to obtain an upper body model, suitable for use within a recursive Bayesian filtering framework. Our model can be viewed as a mixture of discrete Ornstein-Uhlenbeck processes, in that states behave as random walks, but drift towards a set of typically observed poses. This model is combined with measurements of the human head and hand positions, using recursive Bayesian estimation to incorporate temporal information. Measurements are obtained using face detection and a simple skin colour hand detector, trained using the detected face. The suggested model is designed with analytical tractability in mind and we show that the pose tracking can be Rao-Blackwellised using the mixture Kalman filter, allowing for computational efficiency while still incorporating bio-mechanical properties of the upper body. In addition, the use of the proposed upper body model allows reliable three-dimensional pose estimates to be obtained indirectly for a number of joints that are often difficult to detect using traditional object recognition strategies. Comparisons with Kinect sensor results and the state of the art in 2D pose estimation highlight the efficacy of the proposed approach.


page 5

page 14

page 17

page 18

page 19

page 20

page 21


Enhanced Mixtures of Part Model for Human Pose Estimation

Mixture of parts model has been successfully applied to 2D human pose es...

Real-time Tracking-by-Detection of Human Motion in RGB-D Camera Networks

This paper presents a novel real-time tracking system capable of improvi...

Yoga-82: A New Dataset for Fine-grained Classification of Human Poses

Human pose estimation is a well-known problem in computer vision to loca...

RePose: Learning Deep Kinematic Priors for Fast Human Pose Estimation

We propose a novel efficient and lightweight model for human pose estima...

Automatic Analysis of Human Body Representations in Western Art

The way the human body is depicted in classical and modern paintings is ...

Fatigue Detection

Nowadays, there are many fatigue detection methods and the majority of t...

Automatic Assessment of Infant Face and Upper-Body Symmetry as Early Signs of Torticollis

We apply computer vision pose estimation techniques developed expressly ...

Code Repositories


Upper body joint tracking for the RWTH-Phoenix signer database using a Mixture Kalman filter

view repo

1 Introduction

Reliable human pose estimation is a frequently encountered computer vision task, often required for successful vision-based gesture or action recognition systems. Specifically, our goal is to perform gesture recognition for human-robot interaction, which requires the 3D positions of human upper bodies to be tracked. Unfortunately, this is a particularly challenging problem, especially in cluttered environments with potentially moving cameras.

2D information typically suffices if only static gestures are to be recognised, but 3D information is required for most temporal gesture recognition solutions [Wu and Huang, 1999]. Multiple camera motion capture systems can provide 3D measurements with a high level of accuracy, but often require that users wear markers that aid in detection. Stereo camera vision allows for relatively accurate 3D spatial information to be obtained and as a result is commonly used for temporal gesture recognition. This is evidenced by the gesture recognition schemes of Triesch and Von Der Malsburg [1998], Lee [2006] and Nickel and Stiefelhagen [2007], which all use stereo vision systems to observe gestures. 3D information can also be obtained using structured light systems such as the Kinect or PrimeSense depth sensor. Unfortunately, while the Xbox Kinect skeleton tracker Shotton et al. [2011] is extremely effective, in many applications, where payloads are limited, this is infeasible, and a body tracking solution relying only on monocular vision would be preferred.

This paper aims to solve the 3D upper body pose estimation problem using images obtained by only a single camera. We propose a novel upper body model, trained using Kinect pose priors and designed with analytical tractability in mind. We show that pose tracking using this model can be Rao-Blackwellised using the mixture Kalman filter, allowing for computational efficiency while still incorporating bio-mechanical properties of the upper body. The model is used within a recursive Bayesian framework to provide reliable estimates of user head, neck, shoulder, elbow and hand locations when only a subset of body joints can be detected.

Face detection is used to determine head position, and provides a skin colour prior that assists in locating hands. Edge-based error correction is proposed to correct potential hand association errors before head and hand measurements are used to estimate upper body pose.

The paper is organised as follows. Section 2 discusses related work and provides some background to the problem. This is followed by a description of Bayesian filtering for human pose estimation, the introduction of our body models and various tracking algorithms that can be used with these in section 3. A comparison of the trackers is provided in section 4, before we describe how we obtain head and hand measurements from images in section 5. Results obtained when these measurements are used in conjunction with our model and body tracker are presented in section 6, along with a comparison with a recent 2D pose estimation approach [Eichner et al., 2012]. Finally, conclusions are provided in section 7.

2 Background and related work

Effective human pose estimation is required for successful vision-based gesture recognition systems to be deployed. This section describes various approaches to human pose estimation, within the context of gesture recognition.

A vast amount of work has been conducted in the field of human pose estimation using monocular vision. Two approaches to pose estimation from static images have emerged, the first relying on tracking and generative models, and the second on morphological recognition. Morphological recognition techniques can be top-down, where entire bodies are recognised, or bottom-up, where bodies are recognised by locating various body parts or components. Gavrila and Davis [1996] use a top-down, search-based technique to locate poses by matching contours or edges formed using a generative body model with those in an input image.

A number of top-down approaches rely on matching extracted silhouettes to a known database. This technique is applied by Germann et al. [2011] who refine matched pose estimates using a set of 3D body part constraints. This approach relies on multiple cameras though, and the extraction of silhouettes, which can be challenging. Further, the authors note that additional information is required to estimate poses where the arms are close to the body, as silhouettes do not contain sufficient information to do so.

The dominant approach to pose estimation is bottom-up [Yang and Ramanan, 2011], using a pictorial structure of body parts with geometric constraints modelling component interactions. Yang and Ramanan [2011] use a family of affinely warped templates and a mixture model capturing contextual relations and produce good pose estimation results at approximately 1 frame a second. A pictorial structure model is also used by Eichner et al. [2012], who detect bodies using a part-based model, segment these bodies using Grabcut [Rother et al., 2004] and then fit appearance models trained previously using labelled data. This approach also provides good performance, but can be slow, and only works on near frontal and rear viewpoints.

A number of pose estimation techniques use segmentation to locate and extract human bodies. In their work on pose estimation for sign-language videos, Charles et al. [2013]

leverage the layering of signers on video to extract bodies using co-segmentation, before estimating joint locations using a selection of random forests trained on a number of previously segmented bodies (labelled using the work of

Buehler et al. [2009]). Unfortunately, accurate segmentation is slow on general video sequences and not usually feasible for real time applications.

Bottom-up approaches to pose estimation are also used in tracking-based pose estimation approaches. Lee and Cohen [2004]

used a 21 degree-of-freedom generative model of human kinematics, shape and clothing in a data-driven Markov chain Monte Carlo search. Here, visual cues of the face, head and shoulder contours, skin blobs and arm ridges were used to aid importance sampling and drive a Monte Carlo search to feasible 3D pose candidates. Unfortunately, estimating 3D pose estimation from static 2D images results in a number of pose ambiguities as different body configurations can appear similar when viewed from different points.

Many pose estimation techniques rely on Monte Carlo simulation or particle filtering. Particle filters represent the posterior belief in a state, conditioned on a set of measurements, by a set of random state samples drawn from this distribution. While the particle filter is able to approximate non-Gaussian noise distributions extremely well, it is computationally intensive as motion and observation models need to operate on multiple particles. Moreover, the memory requirements of particle filter algorithms are excessive, as the performance of the algorithm is dependent on the number of particles used.

In high dimensional state spaces, the effective number of particles required to approximate the posterior belief can become extremely large and the particle filter tends to operate as a traditional optimisation problem when a feasible number of particles is used. In these cases, additional information is often required to constrain the search space and produce good particle estimates.

Sminchisescu and Triggs [2001] note that many particle filtering algorithms for 3D pose estimation often require the addition of extra noise to assist in the search for minima. They attempt to resolve this by using a complex body model and through careful design of the observation likelihood function, incorporating priors on the anthropometric data of internal proportions, parameter stabilisers, joint limits, and body part penetration avoidance. They also apply covariance scaled sampling to direct the search, which involves combining assumed dynamics with the posterior distribution and growing the prior covariances to sample more broadly. This search can be sped up through the addition of kinematic reasoning to assist in the sampling, reducing the number of possible solutions to a pose if the lengths of limbs are known [Sminchisescu and Triggs, 2003].

Jauregui et al. [2010] also apply kinematic reasoning to aid in pose estimation, but use a silhouette-based observation model. Here, silhouettes are extracted using background subtraction, faces detected and a skin colour model learned. A clothing colour model is also learned, using an image patch directly below the face. These colours are then used when projecting a generative 3D body model, which is compared to the thresholded body.

Deutscher et al. [2000] have proposed the use of simulated annealing to solve the high dimensional search problem associated with 3D pose estimation. Here, a set of weighting functions are used to drive the particle filter search to possible solutions. Davison et al. [2001] perform 3D tracking using multiple cameras and a simulated annealing search. In this case, generative body models are used to create edge and foreground templates, which are compared to those observed using a sum of squared distances metric.

The difficulties in 3D pose estimation from 2D images have led some researchers to focus on 2D pose estimation in images, a slightly better posed problem. Hua et al. [2005] apply Markov chain Monte Carlo estimation to fit a set of 2D quadrangles to humans in images, using an observation model combining colour measurements of the head and hands (learned after face detection), and line segments extracted from the torso.

Applying Monte Carlo search techniques to pose estimation has the benefit of allowing a number of constraints and priors to be incorporated. However, the large number of constraints and complex models required to direct the high dimensional search is hardly ideal, and somewhat inelegant, resulting in large processing burdens. The incorporation of these constraints through priors obtained directly from training data is proposed here, in an attempt to simplify the sampling stages.

The process of learning constraints from training data has been advocated by Yu et al. [2013], who clustered 3D body positions according to various action categories, then used action recognition and 2D body parts detected using a deformable part model to predict 3D pose with a random forest. The use of action recognition restricts the possible pose search space, allowing for faster and more accurate pose estimation.

Howe et al. [1999] used 3D motion capture data to train a Gaussian mixture model prior, which when combined with a Gaussian error model of 2D tracked body parts allows a 3D pose estimate to be computed using Expectation maximisation. Our approach is similar to this as it also uses a Gaussian mixture prior to incorporate body constraints, but differs through the inclusion of temporal motion tracking using recursive Bayesian estimation. In addition, only a subset of body parts need to be detected for body tracking. A description of our pose estimation method follows.

3 Bayesian filtering for human pose estimation

Assuming the human body can be modelled as an unobserved Markov process with a set of joint states at time , recursive Bayesian estimation allows states to be updated as measurements are made.


Here, is a normalising constant and the nomenclature refers to the collection of states from time step 1 to t. This process allows for continual state estimation that includes temporal information, using a transition model to predict state changes and an observation model to introduce measurement information.

For human body tracking, the state vector

could comprise the 3D positions of all joints of interest, camera position and orientation, but this causes a number of estimation difficulties when only 2D image measurements obtained from a single camera measurements are available. In this case, image measurements are a non-linear function of the camera position and orientation, which complicates the tracking problem significantly.

This complication can be avoided by performing all filtering in the image plane and only returning to 3D coordinates when a state estimate is obtained. Let and be image coordinates of a body joint, , observed by a camera with 6 degree-of-freedom pose ,


Here, denotes an intrinsic camera calibration matrix,


with and focal distances and coordinates of the camera’s principal point.

Selecting a state vector comprising the scale parameter , image plane coordinates , and camera pose allows us to make direct comparisons between state and measurements. Once a state estimate is made, returning to 3D coordinates is trivial, with


and denoting the -th column vector of the projection matrix in (3).

3.1 Transition model

We construct a transition model by combining a simple motion model with an objective function or prior:


This decomposition is useful as it allows a prior distribution covering the probability of a human pose occurring to be used to incorporate likely human poses into the motion model. This distribution is obtained offline, by fitting a Gaussian mixture model (GMM) to a large dataset of recorded human body poses. The positions of upper body joints of interest are tracked using a Kinect sensor [Shotton et al., 2011]. Recorded 3D joint positions are then projected into 2D, assuming a pinhole camera with a known camera calibration matrix, , and a random set of camera viewpoints within a set of constraints (, and ; and translation m). This provides a much larger set of recorded 2D joint positions. Figure 1 shows the original 3D recorded pose data, and the corresponding 2D pose data generated through the synthetic viewpoints is shown in Figure 2.

Figure 1:

3D upper body joint distributions are captured by recording a human upper body as it undergoes common motions. (Head - red, shoulders - green, elbows - yellow, hands - blue)


(a) Head distribution


(b) Shoulder distributions


(c) Elbow distributions


(d) Hand distributions
Figure 2: Upper body joint distributions are projected into 2D over a range of viewpoints to generate 2D joint position distributions (A limited range of viewpoints are used for illustration to allow for greater clarity). Lighter colours indicate more likely positions.

This large dataset is infeasible to work with, and so the Gaussian mixture model of this distribution is a useful form of dimension reduction. A more detailed description on GMMs and their training is provided in Appendix A. The Gaussian mixture model is denoted by , the probability of an upper body pose occurring,


Learning the GMM can be computationally intensive and a large number of mixture components may be required. This is remedied by assuming independent left and right arms, and training two mixture models instead.

It is unlikely that states will vary much between time steps, and so we use random walk to describe the motion between states: , with , or


The covariance matrix in (8) is assumed to be a diagonal matrix with each diagonal term selected empirically with image dimensions in mind.

The prior learned from the training data inherently contains kinematic constraints, as well as information on more commonly observed poses. It is also extremely compact and simple. Using this prior, and the fact that the product of two multivariate normal densities over random variable

is another multivariate normal and scaling constant, we can write






As a result, the evidence can be computed as


This provides the final transition model


This model can be viewed as a mixture of discrete Ornstein-Uhlenbeck processes, in that states behave as random walk, but drift towards a set of typically observed mean poses.

3.2 Observation model

The observation model used here is assumed to be a Gaussian centred about the difference between a subset of states and measurements,


The pose state contains the image positions of the head, neck, shoulders, elbows and hands, but it is assumed that only the head, neck and hand states can be measured. These measurements correspond to the subset of states used in the measurement model of (15), selected using . The covariance matrix is assumed to be a diagonal matrix with empirically selected diagonal terms, corresponding to a maximum measurement error in pixels, selected with image dimensions in mind.

3.3 Particle filter approximation

An analytical solution to the integral in (1) is not always easily computed and often an approximation is required. One way of performing this is to approximate the target distribution using a discrete set of samples. Let


where the weights are chosen using importance sampling. Consider the full posterior distribution over all states and measurements, with initial estimate ,


For a Markov process, the current measurement is only dependent on the current state and the current state is only dependent on the previous state, so we can write


Constructing an importance density from which state samples are easily sampled provides importance weights


which can be written recursively as


Since we are only interested in the state at time , and desire an approximation to the density , we can discard the state history and the weight update equation becomes


Unfortunately, sequential importance sampling often suffers from degeneracy problems [Doucet et al., 2000], where the weights of most particles become negligible after a few iterations. This is remedied by resampling, which generates a new set of particles by sampling with replacement according to the importance weights. This typically eliminates particles that have small weights and adds emphasis to those with larger importance. Special care needs to be taken as to the selection of the proposal density . Ideally this should be as close to the target density as possible.

The sampling importance resampling (SIR) or bootstrap filter, discussed in detail by Ristic et al. [2004], is frequently used for recursive Bayesian filtering. Here, the importance density is usually chosen to be equal to the transition density,


This reduces the importance weight calculation to


By applying resampling at each time step, the weights become uniform, and the weight update simplifies to


The SIR filtering procedure is described in more detail in Algorithm 1.

     for  to  do
        Draw Equation (14)
         = Equation (15)
     end for
     Normalise weights
     Resample according to
  end loop
Algorithm 1 Sampling importance resampling particle filter

Resampling may be computationally expensive, so in practise it is not desirable to resample on each iteration. Instead, resampling need only occur when the effective number of particles is below a certain threshold, and the particle filter is close to degeneracy. An estimate of the effective number of particles used by a particle filter [Kong et al., 1994] is


Unfortunately, drawing samples from the Gaussian mixture model of (14) is rather computationally intensive. Sampling from this GMM requires

draws from a uniform distribution to select a mixture component according to the model’s mixture weights, and a further

draws from different Gaussians (due to the dependence of (14) on previous states) to select particles. As an alternative solution, we propose that samples be drawn from the far simpler density , which results in the weight update equation of


An even more efficient approximation could neglect the scaling term entirely, although this could potentially introduce evidence bias in the tails of the distribution. In the following section, we will show that ignoring this term is effectively equivalent to modifying the transition model such that a random walk is applied to each mixture component independently, as opposed to the entire distribution.

Particle filter tracking in high dimensions typically relies on good initial particle estimates. In an attempt to remedy this, we start with much larger joint variance along the diagonals of

in (8) and slowly reduce this over a burn-in period, to allow for an initial particle convergence phase. This can be considered a form of simulated annealing, which has been used previously for pose tracking by Deutscher et al. [2000].

3.4 Mixture Kalman filter

The particle filter is a useful approximation when dealing with complex probability distributions, which cannot be analytically integrated. However, the use of a Gaussian mixture model in the transition density and a conjugate Gaussian observation model allows us to Rao-Blackwellise the particle filter by performing integrations optimally using a number of Kalman filters to track mixture components, in a manner similar to that described by

Alspach and Sorenson [1972]. This approach, termed the mixture Kalman filter, has been applied to a number of conditionally linear dynamic models by [Chen and Liu, 2000] and [Doucet et al., 2000].

Our goal is to calculate the posterior distribution, , given a sequence of measurements. Recall that a prior model on human pose, learned from Kinect training data, can be denoted by a weighted summation of Gaussians, with means and variances and respectively,


This distribution can be partitioned if we introduce an indicator variable , which refers to the

-th mixture component in the distribution. Then the prior probability over states can be denoted as




Applying the random walk transition density selected in (8) to each mixture component independently provides the transition density for the body pose conditioned on the indicator variable and previous state,


which can be solved analytically to provide the normal distribution


Assuming only a subset of states, , can be observed in the presence of zero-mean Gaussian measurement noise with covariance provides a measurement model,


Equations (32) and (33) are of the form required for optimal Bayesian filtering using the Kalman filter [Kalman, 1960]. The Kalman filter marginalises out historical states and provides the posterior distribution of a state for a given trajectory of indicator variables, , conditioned on a mixture component. Here, the boldface , with is used to denote the -th trajectory of mixture components, from time steps to . First, , a prediction of the state mean conditioned on a particular sequence of indicator variables up to time is made using the transition model of (32),


assuming no process noise, with




The existing uncertainty in the mixture component is propagated through the linear process model, and uncertainty in the model included, to provide the predicted mixture covariance,


When observations are made, the measurement and covariance residuals are calculated using




These residuals are then used to provide the updated mean and covariance estimates


where is the optimal Kalman gain for a linear system. Finally, the posterior density for the state conditioned on a trajectory of mixture components can then be described by a Gaussian,


Using this information, the probability of an indicator variable trajectory conditioned on the sequence of measurements, , can be used to obtain the target distribution


Here, denotes the number of indicator components in the motion model, and the number of indicator variable trajectories.

The conditional indicator probability is obtained by marginalising the joint state indicator distribution,


The contents of the integral in (44) are known, with the normal measurement model of (33) and the result of the Kalman filter prediction step, also Gaussian, which we shall denote as . As a result, (44) reduces to an iterative form


with a normalising constant.

Unfortunately the sums in (43) are hard to compute, as the number of trajectories grows exponentially with each filtering iteration, so in practice we approximate (43) as a weighted sum of trajectories of interest,


The mixture Kalman filter uses importance sampling to select the subset of trajectories, with weights updated using


when indicator variables are sampled from the proposal density, .

Using the sampled indicator variables and these weights, a maximum a posteriori estimate for the upper body pose can be obtained through a weighted combination of updated mixture means,


This pose estimate is easily calculated, typically requiring only a small number of parallel Kalman filters, so is far more efficient than a bootstrap particle filter approximation. Finally, a 3D human body pose is obtained by evaluating (5) at the estimated state.

In practice, many of the weights, , can become negligible after a few iterations, with only a few Gaussians contributing to the final pose estimate. This is remedied by resampling with replacement whenever the effective number of particles falls too low.

Importance sampling can be expensive, so a suboptimal approximation to (43) could be obtained by selecting a fixed set of trajectories by some other means. A number of mixture reduction schemes Salmond [1990],Blom [1984] have been proposed previously, but many of these can be expensive. For example, trajectories could be selected by performing the update step for each possible mixture component and input trajectory, then discarding trajectories with low indicator weights. This approach is termed the split-track filter Smith and Winter [1978]. We propose that a subset of trajectories be selected by only retaining trajectories where and , which forces continuity between indicator variables and guarantees that every mixture component is fairly represented in the posterior distribution, in effect giving more weight to the prior distribution on human poses. Here, weights are updated using


As mentioned previously, weights can tend to zero for a given mixture component. Resampling in this case is not ideal, as it could become impossible for this mixture to contribute towards the pose estimate regardless of future measurements. This is undesirable as it effectively removes the mean-reverting properties of the process model. This is remedied by adding a small uniform prior, , to the weights on each iteration. The size of controls the speed at which the process model is able to transition between reverting to the different mixture means in the pose prior.

4 Tracking results

In the previous section, we introduced a motion model suitable for upper body tracking using recursive Bayesian estimation and discussed a selection of tracking schemes to perform this. The first, a bootstrap particle filter, makes proposals from the GMM transition model in (14) and uses the weight update equation in (23). This sampling step is quite time consuming, and the second, faster scheme discussed draws samples from the simple random walk in (8) for use with the weight update equation in (26). The third tracker neglects the scaling evidence term in the weight update equation of (26) to obtain an even faster approximation. Neglecting this term is equivalent to assuming independence across mixture components, or that the transition noise is added to each mixture component separately. The final two tracking schemes introduced also use this slightly modified transition model, where noise is added to each mixture component independently, to allow for an iterative solution using the mixture Kalman filter. The importance sampling step used to select indicator variables in this scheme can be time consuming, so an approximation using a deterministic set of indicator trajectories was also proposed, where each indicator variable selected is paired with a specific trajectory.

Results obtained after applying the five tracking schemes discussed to manually annotated image sequences are provided here. Each of the schemes was applied to image sequences with a moving person, and the pose estimates compared to those obtained using the Kinect motion tracker. Independent datasets were used to learn the pose priors and test the pose estimates. Figure 3 shows the mean pixel error for each joint over the test sequence.

Figure 3: Average joint error over an image sequence containing a moving person.

No simulated annealing was used for the scheme sampling from the full Gaussian mixture model, as this required a larger level of noise in the transition model in order to avoid losing track of the joints completely. The figure shows that the best performance was obtained using the mixture Kalman filter (MKF) approaches. Of the particle filter approaches, the sampling scheme with no scaling converged and tracked the actual pose best, with rather poor tracking achieved when weighting was included. The theoretically preferred Gaussian mixture model sampling was unable to adequately track motion, presumably due to its slow convergence.

A commonly used metric that assesses the performance of 2D pose estimation algorithms is the probability of correct pose (PCP) [Yang and Ramanan, 2011], which shows the percentage of correctly localised body parts, where a body part is deemed to be correctly localised if its end points fall within some fraction of the ground truth body part length. Figure 4 shows the PCP curves for each of the various tracking schemes (only forearm and upper arm localisation is considered). This metric highlights the performance of the Mixture Kalman filters.

Figure 4: Probability of correct pose (PCP) curves for the various tracking approaches show that the MKF and simple sampling (without scaling) strategies are generally the best estimators of pose.

Figure 5 indicates the pixel errors obtained for each joint over the entire test period. Noticeable error spikes that occur when the particle filters are used are not present in the mixture Kalman filter results. The superior performance of the mixture Kalman filter approaches and the simple sampling scheme disregarding scaling make it is clear that the modified transition density of (32), where noise is added to each component independently, is a better model of human motion than that of (14).

Figure 5: Tracking error for an image sequence of a moving upper body, when each of the five suggested tracking schemes is applied.

Table 1 shows the average time taken for each filter iteration, when each of the suggested tracking schemes is used. It is clear that sampling from the full GMM is significantly more time consuming than the simple sampling, but that the mixture Kalman filters are far faster than all of the particle filter approximations.

Sampling strategy Time
Simple sampling, with scaling (10000 particles) 0.046 s
Simple sampling, no scaling (10000 particles) 0.028 s
GMM sampling (10000 particles) 2.947 s
MKF (30 mixture components) 0.021 s
MKF, fixed tracks (30 mixture trajectories) 0.015 s
Table 1: Average iteration times

Note that the mixture Kalman filter approximation using deterministically selected tracks provides almost identical performance to the MKF using sampled indicator variables, but is significantly faster. Qualitative results show that using the MKF with fixed tracks provides a much smoother tracking result (see accompanying videos). This deterministic MKF also appears to be better at dealing with uncommon scenarios such as raised arms (Figure 6), which is presumably due to the fact that all mixture components are paired with a specific trajectory, and as a result can always contribute to a pose estimate. In contrast, the original MKF will place emphasis on mixture components that carry more weight, and this effect will propagate until components of less weight become negligible. The MKF distribution obtained when sampling indicators may be closer to the true joint distribution, but appears less suited to providing a point estimate as a result, since it appears to be more susceptible to ambiguities in the pose estimation.

Figure 6: Mean poses and weights learned from the training data show that rare events such as raised and bent arms are considered less important than frequently observed poses by our model. As a result, these poses are less likely to contribute to a pose estimate when importance sampling is used because the corresponding indicator variables are sampled less frequently. Forcing trajectories to contain these indicators by using the proposed deterministic indicator selection means point estimates are more likely to include rare events when they do occur.

5 Automatically obtaining measurements

Thus far, manually annotated images have been used to compare pose estimation schemes. The process of detecting head and hand positions and incorporating these into the filtering framework is now described.

5.1 Face detection and tracking

Automatic face detection is frequently required by computer vision systems and a large number of extremely effective algorithms are available to accomplish this. In this work, an OpenCV [Bradski, 2000] implementation of the well known Viola and Jones [2001]

face detector is applied. This detector classifies faces using a cascade of boosted classifiers, trained using the responses to Haar-like features.

The face detector is trained over a wide selection of faces, but only frontal faces are used as positive training examples, in line with our end application of human-robot interaction. In these applications, a robot should only attempt to engage with a person who is looking directly at it, in the same way humans make eye contact when conversing.

The face detection is augmented through the addition of face tracking using a Kalman filter [Kalman, 1960] and constant velocity motion model, applying a modified version of the simple object tracker described by Burke [2010]. This tracking provides a degree of robustness to false negatives (faces present, but not detected), and can be used to reject false positives (faces detected, but not present) as these tend to be detected sporadically and fail to provide lasting tracks.

With each input image, detected faces are compared to tracked faces using a Euclidean norm distance measure, including the position and size (height and width) of the faces. If this measure falls below a certain threshold, the update stage of the Kalman filter is applied to the corresponding tracked face. If this is not the case, a new track is started. When faces have not been observed for a certain number of time steps, they are removed from the list of tracked faces. Similarly, tracked faces are only used if the track has lasted for a predefined length of time.

5.2 Hand detection and tracking

Once detected, faces contain important information, which can assist in the detection of other body parts. This section shows how the detected face can be used to determine the tracked person’s skin colour, and segment hands.

First, a histogram of the colours (Lab colour space) present in a square image patch bounding the detected face is back-projected to provide a likelihood map of image areas resembling skin. Here, back-projection refers to the process of evaluating the probability of an image pixel being skin coloured, with the likelihood approximated by a histogram of the detected pixel values in a training image patch. An exponentially weighted moving average filter favouring historical measurements is applied to the histogram to limit the effects of spurious lighting dependent observations.

Originally, a Gaussian mixture model was trained using this image patch and used for skin colour segmentation, but this proved computationally expensive, and provided little improvement over a simple back-projection. In order to assist in the recognition of hands, areas of high likelihood are only labelled as left or right hands when placed within an initialisation area, consisting of the left and right halves of the input image. This serves as the hand detection process. The hand likelihood image can contain unwanted static artefacts, due to skin coloured objects or shadows in the image. We can remove these artefacts by applying a background segmentation algorithm [Zivkovic, 2004], which classifies pixels as foreground or background objects using an adaptive per pixel Gaussian mixture model. This segmentation process labels static objects as background by maintaining a history of pixel values over frames, so assumes a static camera. As a result, this assumption may not be ideal for applications in mobile robotics. Fortunately, the background segmentation is not essential and can be removed for mobile applications, with only a slight degradation in qualitative hand detection results.

Immediately after initialisation, a mean-shift tracker [Bradski, 1998] is used to track the detected hands. Mean-shift locates the maxima of a likelihood function, in this case the re-projection likelihood obtained using the detected face, using discrete samples from the distribution. On each iteration, the original hand position is adjusted based on the mean-shift maxima. As hands typically form larger blobs than wrists, the mean-shift tracker tends to remain centred on hands, and does not typically move along skin coloured forearms. When combined with the initialisation process, this allows for relatively robust hand tracking. If hands are lost (the average likelihood in the tracked hand area drops below a predefined threshold), the user simply re-initialises the hand tracker by returning their hands to the original initialisation area.

The mean-shift tracker is unable to track rapidly moving objects particularly well, so is augmented through the use of a constant velocity Kalman filter tracker similar to that used for face tracking, which provides a predicted region of interest in which to search for a hand and improves the mean-shift tracking. Note that the predicted region could have been obtained by using the predicted body position in the mixture Kalman filter framework, but it turns out that the random walk motion model is not a very good predictor of hand positions, since it contains no velocity information. Figure 7 illustrates the detection and tracking process.

(a) Likelihood map
(b) 2D Pose estimate
(c) 3D Pose estimate
Figure 7: An image patch (green square) centred in the detected face (white square) is used to build a likelihood map of skin coloured areas in the image (Figure (a)a). Detected hands are shown using yellow ellipses, with cyan ellipses showing the predicted hand locations. Once detected, the head and hand estimates are used to update the mixture Kalman filter and provide 2D (Figure (b)b) and 3D (Figure (c)c) pose estimates.

Unfortunately, the use of skin colour to detect hands leads to difficulties in discriminating between hands. This is alleviated somewhat by masking the image area predicted to contain the left hand when tracking the right hand, and vice versa with the left, but problems still occur when hands merge, or for clapping motions, where a constant velocity prediction causes hands to swap. Examples of these failures are shown in Figure 8. Micilotta and Bowden [2004] have proposed the use of a GMM trained using prior pose estimates to disambiguate left and right hands, but this simply tends to identify hands as left if they are found to the left side of the head (and vice versa to the right), sometimes incorrectly rejecting instances where hands cross the body.

Figure 8: Similarities in hand appearance occasionally result in merging hands and tracking failures. Figures (a)a to (f)f show a failure due to merging hands, while Figures (g)g to (i)i highlight a failure to track a clapping motion, resulting from the constant velocity motion model. A better hand detector or knowledge of forearm position could be used to remedy this.

5.3 Edge-based hand association

Errors resulting from incorrect hand association could be avoided by taking the orientation of the arms into account in the hand detection process. Unfortunately, it is quite difficult to detect arms, which can have highly variable appearances in images. However, once hands have been detected, we can assess the validity of a pose estimate using additional image features and use this to correct hand association errors. The MKF pose estimate contains the 2D position of each joint, and can be used to form a stick model similar to that drawn in Figure 8, with limbs described by a set of oriented edges. As a result, we propose that a natural measure of a pose estimate’s likelihood is one that uses orientation information from edges detected in the image.

Initially, an edge-based image representation is obtained using the Canny edge detector [Canny, 1986]. The probabilistic Hough line detector [Matas et al., 2000] is then used to detect linear edge segments. The number of edge segments providing support for a pose estimate or limb position is then used to decide if the correct hand association has been made, or if the pose estimate is in error. We use a Gaussian kernel to determine edge support, with edges considered as evidence for a given limb if the likelihood


is greater than some threshold . Here, is a vector of the edge orientation and the image position of a detected edge midpoint, while contains the position and orientation of the estimated limb. is a diagonal covariance matrix, with variances selected empirically to allow feasible position and angle offsets. Figure 9 illustrates the voting process for a given pose estimate.

(a) Correct pose estimate
(b) Incorrect pose estimate
Figure 9: Detected image edges are used to determine the validity of a pose estimate. Edges providing no evidence are blue, while supporting edges are drawn in the same colour as the supported limb. In the case of valid pose estimates (Figure (a)a), a number of edges with similar position and orientation to the estimated limb position tend to be observed. This typically fails to occur when a hand association error has occurred or if a pose estimate is incorrect (Figure (b)b).

The proposed heuristic allows for data association errors in hand measurement to be corrected relatively quickly, but does not prevent these errors from occurring in the first place. Direct measurement of limb positions should eliminate hand association errors of this type completely.

6 Combined detection and tracking results

Results obtained when the head and hand detectors of Section 5 are used in conjunction with the mixture Kalman filter (deterministically selected tracks) are provided here. Figure 10 shows the mean error for each joint over a test sequence of more than 1000 images, when estimated 3D positions were compared with the skeleton output of a Kinect sensor, by aligning the head, neck and shoulders using fixed scale Procrustes analysis [Schönemann, 1966]. This comparison is not ideal, as the Kinect is not perfectly accurate and often fails when hands cross over the body, but it does provide an indication that the 3D pose estimate is plausible.

Figure 10: Average 3D joint error over an image sequence containing a moving person.

Figure 11 shows the position errors obtained for each joint over the entire test period, when compared with the positions obtained with the Kinect sensor. For much of the time, the error in hand position remains below 20 cm, with error spikes only occurring when the hands crossed the body or moved rapidly. A particularly encouraging result is that the average elbow errors remained quite low, even though no measurements of these joints were made at all. Qualitative results can be seen in the accompanying video, which also shows the edge-based error correction in operation.

Figure 11: 3D Tracking error for an image sequence of a moving upper body.

A noticeable source of error involved uncommon poses that were not present in the prior training data. This should be remedied by additional training, but potentially at the expense of pose estimation accuracy in other poses.

Figure 12: 2D pose errors over time obtained by applying the pose estimation of Eichner et al. [2012] and our MKF approach, with and without background subtraction.

Figure 12 shows the 2D tracking errors obtained by applying the Eichner et al. [2012] 2D pose estimation approach (the current state of the art in 2D pose estimation in unconstrained images) and our technique to a sequence of over 500 images, using Kinect joint tracks as ground truth. This sequence is particularly challenging for our approach as the study participant is wearing short sleeves, which could potentially result in hand detection failures due to skin coloured arm regions, and increases the risk of incorrect hand association. The latter occurs towards the end of the tracking sequence, resulting in large hand and elbow tracking errors that failed to be corrected immediately by the edge-based pose correction. The figure also shows the results of the MKF pose estimation when background subtraction is not applied, which are very similar to those obtained when this is included. In fact, additional noise in the likelihood map used for hand detection prevented the hand association failure that occurred when background subtraction was applied, resulting in overall improved performance, although a number of spurious pose estimates were observed instead.

It should be noted that the Eichner et al. [2012] approach is at a disadvantage here as it does not incorporate temporal information, but it does perform far more processing, operating at approximately 0.5 frames a second. Our approach operates at just under 30 frames per second, with face detection the primary bottle neck.

In practise, the Eichner et al. [2012] pose estimation performed well at upper arm detections, but typically failed at forearm detections, presumably due to the cluttered background used for experimentation. Figure 13 shows the PCP curves comparing the 2D pose estimation accuracy. Our approach shows significant improvement in detection rates for greater detection thresholds, while providing similar performance to Eichner et al. [2012] over smaller thresholds. Once more, improved performance was seen when background subtraction was not applied, resulting from the absence of incorrect hand association in this test set, but a performance reduction is expected in cases where a number of skin-coloured objects are present in the image background. Qualitative results can be seen in the accompanying video, which provides a comparison with Kinect pose estimation and that of Eichner et al. [2012].

Figure 13: Probability of correct pose (PCP) curves for the MKF (with and without background subtraction) and Eichner et al. [2012] approaches show significantly better performance when our approaches are used.

7 Conclusions and future work

This paper has provided results on upper body pose tracking using Kinect joint priors and simple hand and head measurements. Four tracking schemes have been considered and a mixture Kalman filter shown to provide effective upper body pose estimation. The use of the proposed upper body model allows reliable pose estimates to be obtained indirectly for a number of joints that are often difficult to detect using traditional object recognition strategies. The suggested model is designed with computational efficiency and analytical tractability in mind, yet still incorporates bio-mechanical properties of the upper body, typically only included using more complex body models.

Comparisons with the current state of the art in 2D pose estimation [Eichner et al., 2012] have shown that our approach outperforms this significantly, both in terms of estimation performance and time complexity. Good 3D tracking results were also exhibited during experimentation.

A mechanism for correcting hand data association errors has been provided, but these errors will continue to occur without the inclusion of additional joint measurements. Improved hand association is required if multiple humans are to be tracked at once. While good results have been obtained for a constrained set of camera viewpoints, additional priors and improved measurements may be required to resolve pose ambiguities if 3D position is required over a larger range of viewpoints. Future work will involve the inclusion of a better mechanism for detecting hands and evaluating the effects of including additional training data collected from multiple persons.


  • Alspach and Sorenson [1972] D. Alspach and H. Sorenson. Nonlinear Bayesian estimation using Gaussian sum approximations. Automatic Control, IEEE Transactions on, 17(4):439–448, 1972.
  • Bishop [2006] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
  • Blom [1984] H. A. P. Blom. An efficient filter for abruptly changing systems. In Decision and Control, 1984. The 23rd IEEE Conference on, volume 23, pages 656–658, Dec 1984. doi: 10.1109/CDC.1984.272089.
  • Bradski [2000] G. Bradski. The OpenCV library. Dr. Dobb’s Journal of Software Tools, 2000.
  • Bradski [1998] G. R. Bradski. Real time face and object tracking as a component of a perceptual user interface. In Applications of Computer Vision, 1998. WACV ’98. Proceedings., Fourth IEEE Workshop on, pages 214–219, 1998.
  • Buehler et al. [2009] P. Buehler, A. Zisserman, and M. Everingham. Learning sign language by watching TV (using weakly aligned subtitles). In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 2961–2968, June 2009. doi: 10.1109/CVPR.2009.5206523.
  • Burke [2010] M. Burke. Laser-based target tracking using principal component descriptors. In 21st Annual Symposium of the Pattern Recognition Association of South Africa (PRASA), 2010.
  • Burke and Lasenby [2014] M. Burke and J. Lasenby. Fast upper body joint tracking using kinect pose priors. In F. Perales and J. Santos-Victor, editors, AMDO 2014, volume 8563 of Lecture Notes in Computer Science, pages 94–105. Springer Berlin / Heidelberg, 2014.
  • Canny [1986] J. Canny. A computational approach to edge detection. Pattern Analysis and Machine Intelligence, IEEE Transactions on, PAMI-8(6):679–698, 1986.
  • Charles et al. [2013] J. Charles, T. Pfister, M. Everingham, and A. Zisserman. Automatic and efficient human pose estimation for sign language videos. International Journal of Computer Vision, 2013.
  • Chen and Liu [2000] R. Chen and J. S. Liu. Mixture Kalman filters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(3):493–508, 2000.
  • Davison et al. [2001] A. J. Davison, J. Deutscher, and I. D. Reid. Markerless motion capture of complex full-body movement for character animation. In Proceedings of the Eurographic workshop on Computer animation and simulation, pages 3–14, New York, NY, USA, 2001. Springer-Verlag New York, Inc.
  • Deutscher et al. [2000] J. Deutscher, A. Blake, and I. Reid. Articulated body motion capture by annealed particle filtering. In Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, volume 2, pages 126–133, 2000.
  • Doucet et al. [2000] A. Doucet, S. Godsill, and C. Andrieu. On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing, 10(3):197–208, 2000.
  • Eichner et al. [2012] M. Eichner, M. Marin-Jimenez, A. Zisserman, and V. Ferrari. 2D articulated human pose estimation and retrieval in (almost) unconstrained still images. International Journal of Computer Vision, 99:190–214, 2012.
  • Gavrila and Davis [1996] D. M. Gavrila and L. S. Davis. Tracking of humans in action: a 3-D model-based approach. In In Proc. ARPA Image Understanding Workshop, pages 737–746, 1996.
  • Germann et al. [2011] M. Germann, T. Popa, R. Ziegler, R. Keiser, and M. H. Gross. Space-time body pose estimation in uncontrolled environments. In 3DIMPVT’11, pages 244–251, 2011.
  • Howe et al. [1999] N. R. Howe, M. E. Leventon, and W. T. Freeman. Bayesian reconstruction of 3D human motion from single-camera video. In Advances in Neural Information Processing Systems, pages 820–826. MIT Press, 1999.
  • Hua et al. [2005] G. Hua, M. Yang, and Y. Wu. Learning to estimate human pose with data driven belief propagation. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 747–754, June 2005.
  • Jauregui et al. [2010] D. A. G. Jauregui, P. Horain, M. K. Rajagopal, and S. S. K. Karri. Real-time particle filtering with heuristics for 3D motion capture by monocular vision. In Multimedia Signal Processing (MMSP), 2010 IEEE International Workshop on, pages 139–144, Oct. 2010.
  • Kalman [1960] R. E. Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering, 82(Series D):35–45, 1960.
  • Kong et al. [1994] A. Kong, J. S. Liu, and W. H. Wong.

    Sequential imputations and Bayesian missing data problems.

    Journal of the American Statistical Association, 89(425):278–288, 1994.
  • Lee and Cohen [2004] M. W. Lee and I. Cohen. Human upper body pose estimation in static images. In T. Pajdla and J. Matas, editors, Computer Vision - ECCV 2004, volume 3022 of Lecture Notes in Computer Science, pages 126–138. Springer Berlin Heidelberg, 2004.
  • Lee [2006] S. Lee. Automatic gesture recognition for intelligent human-robot interaction. In Automatic Face and Gesture Recognition, 2006. FGR 2006. 7th International Conference on, pages 645–650, April 2006.
  • Matas et al. [2000] J. Matas, C. Galambos, and J. Kittler. Robust detection of lines using the progressive probabilistic hough transform. Computer Vision and Image Understanding, 78(1):119 – 137, 2000.
  • Micilotta and Bowden [2004] A. S. Micilotta and R. Bowden. View-based location and tracking of body parts for visual interaction. In BMVC, pages 1–10, 2004.
  • Nickel and Stiefelhagen [2007] K. Nickel and R. Stiefelhagen. Visual recognition of pointing gestures for human-robot interaction. Image and Vision Computing, 25(12):1875–1884, 2007. The age of human computer interaction.
  • Ristic et al. [2004] B. Ristic, S. Arulampalam, and N. Gordon. Beyond the Kalman Filter: Particle Filters for Tracking Applications. Artech House, 2004.
  • Rother et al. [2004] C. Rother, V. Kolmogorov, and A. Blake. “GrabCut”: interactive foreground extraction using iterated graph cuts. In ACM SIGGRAPH 2004 Papers, SIGGRAPH ’04, pages 309–314, New York, NY, USA, 2004. ACM.
  • Salmond [1990] D. J. Salmond. Mixture reduction algorithms for target tracking in clutter. volume 1305, pages 434–445, 1990. doi: 10.1117/12.21610. URL
  • Schönemann [1966] P. H. Schönemann. A generalized solution of the orthogonal Procrustes problem. Psychometrika, 31(1):1–10, 1966.
  • Shotton et al. [2011] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In Computer Vision and Pattern Recognition, June 2011.
  • Sminchisescu and Triggs [2001] C. Sminchisescu and B. Triggs. Covariance-scaled sampling for monocular 3D body tracking. In IEEE International Conference on Computer Vision and Pattern Recognition, volume 1, pages 447–454, Hawaii, 2001.
  • Sminchisescu and Triggs [2003] C. Sminchisescu and B. Triggs. Kinematic jump processes for monocular 3D human tracking. In Proceedings of the 2003 IEEE computer society conference on Computer vision and pattern recognition, CVPR’03, pages 69–76, Washington, DC, USA, 2003. IEEE Computer Society.
  • Smith and Winter [1978] M. C. Smith and E. M. Winter. On the detection of target trajectories in a multi target environment. In Decision and Control including the 17th Symposium on Adaptive Processes, 1978 IEEE Conference on, volume 17, pages 1189–1194, Jan 1978.
  • Triesch and Von Der Malsburg [1998] J. Triesch and C. Von Der Malsburg. A gesture interface for human-robot-interaction. In Automatic Face and Gesture Recognition, 1998. Proceedings. Third IEEE International Conference on, pages 546–551, April 1998.
  • Viola and Jones [2001] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I–511–I–518, 2001.
  • Wu and Huang [1999] Y. Wu and T. Huang. Vision-based gesture recognition: A review. In A. Braffort, R. Gherbi, S. Gibet, D. Teil, and J. Richardson, editors, Gesture-Based Communication in Human-Computer Interaction, volume 1739 of Lecture Notes in Computer Science, pages 103–115. Springer Berlin / Heidelberg, 1999.
  • Yang and Ramanan [2011] Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’11, pages 1385–1392, Washington, DC, USA, 2011. IEEE Computer Society.
  • Yu et al. [2013] T.-H. Yu, T.-K. Kim, and R. Cipolla. Unconstrained monocular 3D human pose estimation by action detection and cross-modality regression forest. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’13, pages 3642–3649, Washington, DC, USA, 2013. IEEE Computer Society.
  • Zivkovic [2004] Z. Zivkovic. Improved adaptive Gaussian mixture model for background subtraction. In Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, volume 2, pages 28–31 Vol.2, 2004.

Appendix A Gaussian mixture models

Many tracking applications require a suitable probabilistic model of prior and likelihood distributions. Gaussian mixture models (GMMs) are a popular choice of model for probability distributions due to their ability to approximate a wide variety of complex distributions with a limited number of parameters. Only a brief overview of GMMs is provided here, but readers are referred to Bishop [2006] for additional information. These models are particularly useful in acquiring an analytical approximation to a probability distribution when only discrete samples from the distribution are available. Formally, a Gaussian mixture model is defined as




with parameters , and . denotes the length of the state vector . is symmetric and positive definite.

Training a GMM using discrete data is accomplished through expectation maximisation. Expectation maximisation is an iterative two step process obtaining the maximum likelihood estimation of parameters in a model. Assuming observations, start with an initial, random estimate of the model parameters and calculate the responsibility that the -th Gaussian takes for explaining an observation ,


This is termed the expectation step. The maximisation stage occurs by applying analytic estimators to maximise the likelihood of the data. Parameter is calculated as


and as


The effective number of points assigned to the -th Gaussian in the mixture model is calculated as