Estimation of Human Body Shape and Posture Under Clothing

12/17/2013 ∙ by Stefanie Wuhrer, et al. ∙ National Research Council Canada Max Planck Society Universität Saarland uOttawa Fraunhofer 0

Estimating the body shape and posture of a dressed human subject in motion represented as a sequence of (possibly incomplete) 3D meshes is important for virtual change rooms and security. To solve this problem, statistical shape spaces encoding human body shape and posture variations are commonly used to constrain the search space for the shape estimate. In this work, we propose a novel method that uses a posture-invariant shape space to model body shape variation combined with a skeleton-based deformation to model posture variation. Our method can estimate the body shape and posture of both static scans and motion sequences of dressed human body scans. In case of motion sequences, our method takes advantage of motion cues to solve for a single body shape estimate along with a sequence of posture estimates. We apply our approach to both static scans and motion sequences and demonstrate that using our method, higher fitting accuracy is achieved than when using a variant of the popular SCAPE model as statistical model.



There are no comments yet.


page 14

page 23

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of estimating the body shape and posture of a dressed human subject is important for various applications, such as virtual change rooms and security. For instance, in virtual change rooms, a dressed user steps in front of a virtual mirror and the system aims to simulate different types of clothing for this user. To this end, such a system requires an accurate estimate of the body shape and posture of the user.

We present an algorithm to estimate the human body shape and posture under clothing from single or multiple 3D input frames that are corrupted by noise and missing data. Our approach assumes that the clothing fits to the body and may fail for loose clothing, such as skirts or wide dresses. When multiple 3D frames of the same human subject are recorded in different postures, these observations provide important cues about the body shape of the subject. The clothing may be more or less loosely draped around a particular body part in different postures, which allows for improved shape estimates based on postures where the clothing is close to the body shape. To utilize these cues, we model body shape independently of body posture, and optimize a single representation of the body shape of the subject along with one pose estimate per frame to fit to a set of input frames. When multiple 3D frames of a subject in motion are recorded with high frame rates, our algorithm takes advantage of the temporal consistency of the acquired data. To reduce the complexity of the problem, our method does not explicitly simulate the clothing, but learns information about likely body shapes using machine learning.

Current solutions to this problem use the SCAPE model [2] to represent the body shape and posture of a human subject. This model represents the body shape in a statistical shape space learned from body scans of multiple subjects acquired in a standard posture and combines this with a representation of body posture learned from body scans of a single subject in multiple postures. A popular variant of SCAPE that performs well in practice is the method by Jain et al. [18] that learns variations in body posture using a skeleton-based deformation. The main disadvantage of these methods is that even when acquiring multiple subjects in standard posture, the postures differ slightly, which leads to a statistical space for body shape that represents a combination of shape and posture changes. Hence, for SCAPE and its variants, shape and posture representations are not properly separated.

To remedy this problem, we propose a method that uses a posture-invariant statistical shape space to model body shape combined with a skeleton-based deformation to model body posture. Using a posture-invariant statistical shape space for body shape offers the additional advantage that the shape space can be learned based on body scans of multiple subjects acquired in multiple postures, thereby allowing to leverage more of the available training data.

This work makes the following main contributions:

  • We present a representation that models human body shape and posture independently. Human body shape is represented by a point in a posture-invariant shape space found using machine learning, and human body posture is represented using skeletal joint angles.

  • We present an algorithm to estimate body shape and posture under clothing that fits closely to the body from single or multiple 3D input frames. For multiple input frames, a single representation of body shape is optimized along with a posture estimate per frame to fit to the input frames. This allows to take advantage of important cues about body shape from multiple frames.

  • When multiple 3D frames of a subject in motion are recorded with high frame rates, the presented fitting approach is stable as temporal consistency is used for tracking.

  • We show experimentally that using our method, higher fitting accuracy is achieved than when using the state of the art variant of SCAPE by Jain et al. [18].

2 Related Work

The problem of estimating the body shape and posture of humans occurs in many applications and has been researched extensively in computer vision and computer graphics. Many methods focus on estimating the posture of a subject in an image or a 3D scan

without aiming to predict the body shape (e.g. [16, 4, 27]). Other methods aim to track a given human shape that may include detailed clothing across a sequence of images or 3D scans in order to capture the acquired motion without using markers (e.g. [8, 10, 29, 9, 12]).

In this work, we are interested in estimating both the body shape and posture of any human subject represented as a 3D mesh that was acquired while wearing clothing. To achieve this goal, we need a model that can represent different body shapes in different postures. Statistical shape models have been shown to be a suitable representation in this case.

Statistical shape models learn a probability distribution from a database of 3D shapes. To perform statistics on the shapes, the shapes need to be in full correspondence. Allen et al. 


proposed a method to compute correspondences between human bodies in a standard posture and to learn a shape model using principal component analysis (PCA). This technique has the drawback that small variations in posture are not separated from shape variations. To remedy this, multiple follow-up methods have been proposed. Hasler et al. 

[14] analyze body shape and posture jointly by performing PCA on a rotation-invariant encoding of the model’s triangles. While this method models different postures, it cannot directly be constrained to have a constant body shape and different poses for the same subject captured in multiple postures. With the goal of analyzing body shape independently of posture, Wuhrer et al. [32] propose to perform PCA on a shape representation based on localized Laplace coordinates of the mesh. In this work, we combine this shape space with a skeleton-based deformation model that allows to vary the body posture.

Several methods have been proposed to decorrelate the variations due to body shape and posture changes, which allow to vary body shape and posture independently. The most popular of these models is the SCAPE model [2], which combines a body shape model computed by performing PCA on a population of 3D models captured in a standard posture with a posture model computed by analyzing near-rigid body parts (corresponding to bones) of a single body shape in multiple postures. Chen et al. [7] recently proposed to improve this model by adding multi-linear shape models for each part of the SCAPE model, thereby enabling more realistic deformation behaviour near joints of the body. Neophytou and Hilton [23] proposed an alternative statistical model that consists of a shape space learned as PCA space on normalized postures and a pose space that is learned from different subjects in different postures.

Several authors have proposed to use statistical shape models to estimate human body shape and posture under clothing. Most of these methods use the SCAPE model as statistical model. Muendermann et al. [22] proposed a method to track human motion captured using a set of synchronized video streams. The approach samples the human body shape space learned using SCAPE and initializes the body shape of the subject in the video to its closest sample in terms of height and volume. The approach then tracks the pose of the subject using an iterative closest point method, where joints are modeled as soft constraints. Balan and Black [5] used the SCAPE model to estimate the body shape and posture of a dressed subject from a set of input images. The method proceeds by optimizing the shape and posture parameters of the SCAPE model to find a human body that optimally projects to the observed silhouettes. If the same subject is given in multiple poses, the shape of the subject is assumed to be constant across all poses, and the model optimizes one set of shape parameters and several sets of posture parameters to fit the model to the observed input images. Weiss et al. [30] used a similar technique to fit a SCAPE model to a Kinect scan. Zhou et al. [33] used a SCAPE model to modify an input image. They learned a correlation between the SCAPE model parameters and semantic parameters, such as the body weight, which allows them to modify an instance of the SCAPE model to appear to have higher or lower body weight. The approach first optimizes a learned SCAPE model to fit to the input image, changes the shape of the 3D reconstruction of the subject, and modifies the input image, such that the silhouette of the modified subject is close to the projection of the changed 3D shape. Jain et al. [18] extended this approach to allow for the modification of video sequences. They used a slightly modified version of the SCAPE model that does not learn a subject-specific pose deformation of the triangles. Helten et al. [15] proposed a real-time full body tracker based on the Kinect. They first acquire the shape of a subject in a fixed posture using a Kinect, and then track the posture of the subject over time using the modified SCAPE model by Jain et al. [18] while fixing the shape parameters.

A notable exception to using the SCAPE model is the approach by Hasler et al. [13], which uses a rotation-invariant shape space [14] to estimate body shapes under clothing. Recently, Perbet et al. [25] proposed an approach based on localized manifold learning that was shown to lead to accurate body shape estimates. While these methods have been shown to perform well on static scans, they are less suitable to predict body shape and postures from motion sequences as the body shape cannot be controlled independently of posture in these shape spaces.

In this work, we are interested in fitting a single body shape estimate and multiple body posture estimates to a given sequence of scans, which requires a shape space that models variations of body shape and posture independently. The variant of the SCAPE model proposed by Jain et al. [18] is a commonly used state-of-the-art method that has been shown to lead to accurate body shape and posture estimates and that models shape and posture variations independently. We propose a new shape space that combines a posture-invariant statistical shape model with a skeleton-based deformation, and show that this model can fit more accurately to 3D input meshes than this popular variant of the SCAPE model.

3 Overview

We aim to estimate the body shape and postures of a dressed human in motion given as a set of input frames represented as 3D points clouds. To solve this problem, our approach proceeds in two main steps.


We learn a statistical model based on a database of input scans denoted by . To perform statistics on this database, all models of the database need to be in full point-to-point correspondence. While in general, computing correspondences between 3D models is a challenging problem [28], template fitting approaches can be used in case of human models [1, 14, 31]. In this work, we use the registered publicly available MPI human shape database [14] (which contains a total of 520 models of over 100 subjects in up to 35 different postures) as training data. We learn two types of variations from the registered database. The first type of variation is information about a small set of landmark positions placed on the models, which helps in automatically detecting the corresponding landmarks on frames of a given motion sequence. These detected landmarks are then used to guide our model fitting. The second type of variation is a body shape model that captures body shape variations across different subjects in a posture-invariant way. This model has the advantage of capturing localized shape variations at the cost that it cannot be described using a small number of global linear mappings (such as SCAPE, for instance).


We fit the learned statistical models to a given motion sequence . As the shape model cannot be described using a global linear mapping, we cannot directly fit this model to the data efficiently. To remedy this, the fitting procedure uses a rigged template with manually annotated landmarks and consists of four steps. First, we automatically predict landmark positions on the input frames based on the learned space of landmark positions and given landmarks on the first frame . Second, these landmark positions are used to consecutively fit the posture of to the postures of using a variational approach. Third, the shape of the template model is fitted to the input frames using a variational approach that allows the shape of to fit to details of clothing. After this fitting step, we have a sequence of deformed template shapes that fit closely to the input frames . Note that may not represent realistic body shapes, as the shapes may include geometric detail from the clothing. To remedy this, we restrict the shapes of in a fourth step to a single point in the learned posture-invariant body shape space.

4 Training a Posture-Invariant Statistical Model

This section outlines how to learn a statistical model based on a database of registered input scans . Figure 1 gives a visual overview of the two types of shape variations that are learned.

Figure 1: Overview of training required by our method.

4.1 Landmark Model

We use a Markov network to learn relative locations and local surface properties of the 14 anthropometric landmarks shown as red points on the body shapes on the top left of Figure 1. We follow the approach of Wuhrer et al. [31], which uses the network structure shown on the bottom left of Figure 1, where each red point represents a landmark, which is modeled as a node of the Markov network, and each black edge represents a connection between two landmark points, which is modeled as an edge of the Markov network. The approach uses a training database to learn the following node and edge potentials.

Node Potential

The approach learns a surface descriptor for each landmark of input scan as the area of the geodesic neighborhood of radius centered at divided by the area of a planar disk of radius . Note that is invariant under isometric deformations, which are deformations that do not cause geometric stretching. Since the surface of a human body in different postures exhibits only limited and localized stretch, we can expect the descriptor to be approximately posture-invariant. To learn localized surface properties around landmark , is computed for 20 radii from to over all input models

, and a multivariate Gaussian distribution is fitted to these descriptors. This multivariate Gaussian distribution is used as node potential for

in the Markov network.

Edge Potential

The approach learns information about the spatial relationships between landmarks modeled as edge potentials. To learn this information, we first need to spatially align the training models . However, it is difficult to spatially align models of human subjects due to the large posture variation. Hence, we compute an isometry-invariant canonical form [11] of each of the models in the database. The canonical forms of all the models have a similar posture and can be spatially aligned using a rigid transformation computed using the known landmark positions. We can then learn the locations and relative positions of the landmarks in the space of canonical forms. We use this information to compute the edge potentials of the Markov network by computing the lengths and directions of each edge over all aligned models , and by fitting a multivariate Gaussian distribution to this data.

Since all of the information contributing to the Markov network is isometry-invariant, this approach learns posture-invariant information about the landmark locations, which enables us to predict landmarks in arbitrary postures.

4.2 Shape Model

To represent human body shape, we learn a posture-invariant statistical shape model based on localized Laplace coordinates, as proposed by Wuhrer et al. [32]. This model, which we summarize in the following, is learned by performing PCA of a population of human shapes in arbitrary postures using a posture-invariant shape representation, and visualized on the right of Figure 1.

This shape representation stores for each vertex of the Laplace offset in a local coordinate system. That is, we find a posture-invariant representation of by computing the combinatorial Laplace matrix of . With the Laplace matrix, we can compute the Laplace offsets as


where denote the vertices of . These offsets are not posture-invariant. Hence, we express each offset with respect to the following local coordinate system. At each vertex , we pick an arbitrary but fixed neighbor as the first neighbor (we choose the same first neighbor for all of the parameterized meshes). We then compute a local orthonormal coordinate system at

using the normal vector at

, the normalized projection of the difference vector to the tangent plane of , and the cross product of the previous two vectors. We denote the three vectors defining the local orthonormal coordinate system by , and . Since the local coordinate system is orthonormal, we can express in this coordinate system as


The local coordinates are designed to be invariant with respect to rigid transformations of the one-ring neighborhood of . To account for global scaling of the shape, we also store a coefficient related to the scale of the shape. More specifically, is computed as the average geodesic distance between any two vertices on computed using the fast marching technique [19].

We then perform statistical shape analysis by performing PCA on the vectors over all shapes. Let denote the learned posture-invariant shape space. To avoid problems related to over-fitting a statistical model, in this work, we keep only about of the shape variability present in the training set.

5 Estimating Body Shape and Posture from Motion Sequences of Dressed Subjects

This section describes our proposed approach to estimate the body shape and posture of a sequence of input meshes

showing a dressed human in motion. Ideally, we would like to fit the learned shape model to the data directly. However, this is not efficient because the posture-invariant shape model cannot be described using a small number of global linear transformations. Hence, we use a fitting procedure consisting of four steps. Figure 

2 gives a visual overview of the four steps of the approach. First, we use the learned Markov network to predict the locations of the 14 landmarks . Specifically, we require the user to provide the locations of for , and then predict on the remaining input frames automatically. The advantage of user-specified landmarks on the first frame is that the landmark tracking starts with a good initialization. Second, we use the landmark locations to fit the posture of the rigged template to the frames . This deforms the skeleton of using a piecewise rigid transformation that is blended onto the surface of . That is, is deformed using an approximately piecewise rigid transformation in this step. Third, we fit the body shape of to the observed data using a non-rigid deformation model, which allows for to deform closely to . Let denote the deformation of that was fitted to . Once the posture and shape of has been fitted to each of the input frames , the resulting shapes may not represent realistic human body shapes because parts of may be close to data acquired from clothing. Fourth, to find a single realistic body shape estimate in multiple postures, we restrict the shapes of to a single point in the learned posture-invariant shape space. Figure 2 shows results for each of the four steps for two input frames.

Figure 2: Overview of fitting procedure. Blue boxes show the input to our method and green boxes show the results.

5.1 Landmark Prediction

We now outline how the landmark locations are predicted using probabilistic inference on the Markov network with learned potentials that is described in Section 4.1. Given an input mesh , we need a set of possible labels, which represent possible locations for the landmark locations in order to perform probabilistic inference. For a possible label for location , we can compute the node potential as for the 20 possible values for the radii used for training, which allows to compute the probability of being the location of landmark on . Given pairs of possible labels of landmarks that are connected by an edge in the Markov network, we can compute the edge potential by computing the distance between the two labels in the canonical form of , which allows to compute the joint probability of the two labels being the locations of the corresponding landmarks. Since the graph representing the connections between the landmark locations is a tree, a simple message passing scheme can then be used to find the labels that maximize the joint probability of being the landmark locations [24, Chapter 4].

It remains to discuss how the sets of possible labels for landmark are found. Recall that we assume that the landmark locations on the first frame are provided by the user (this is the only user input assumed by our fitting algorithm). For the remaining frames, we take advantage of the temporal consistency of the input sequence to find sets of possible labels for based on the predicted landmark locations on frame . That is, vertices on in the neighborhood of the predicted landmark on are considered as candidate labels for . In our implementation, we choose as label set the 200 points on on that are closest to .

This selection of the label set, which is the main difference to the landmark prediction method by Wuhrer et al. [31] that predicts landmarks on a static scan using label sets found using the canonical form of , has two advantages. First, our approach is computationally more efficient than the previous method as, thanks to the temporal consistency between adjacent frames, a single label set suffices to predict landmarks accurately. In contrast, the method by Wuhrer et al. considers eight label sets found using eight possible alignments in canonical form space, computes a candidate solution for each label set using probabilistic inference, and finally selects the most suitable solution automatically using an energy term. Second, our approach is designed to lead to stable solutions as corresponding landmarks in adjacent frames are close to each other, which prevents prediction errors due to symmetric regions (i.e. mixing up the left and right sides of the body).

Hence, by design, the tracking of the landmarks is robust with respect to changes that have the property that each landmark on is in the neighborhood of its corresponding landmark on . We validate experimentally that this assumption holds for human motion sequences even in the presence of fast localized movements. Note that since we perform probabilistic inference on the learned Markov network to find the best landmark location, the landmark on does not need to be the closest neighbor to its corresponding landmark on .

5.2 Posture Fitting

Given a set of (predicted) landmarks on , we aim to fit the posture of a rigged template model to the posture of . We compute our template as the mean shape over all models of the training database that were captured in a standard posture. The model is rigged using the publicly available software Pinocchio [6], and the landmark locations are manually placed on .

We model the deformation of the skeleton of using a scene graph structure consisting of 17 bones, where bones are ordered in depth first order, and the transformation of each bone is expressed using a local transformation relative to its parent. The bone structure of the rigged template is shown in the top row of Figure 2. The root bone is transformed using a rigid transformation consisting of a rotation (parameterized using a rotation axis and angle), a scale factor, and a translation vector. The relative transformations of the remaining bones are expressed using a rotation with respect to their parent bones. We denote the transformation parameters of the bones by . Note that it is straight forward to compute the global bone transformations using composite transformations.

Our posture fitting method extends the variational approach proposed by Wuhrer et al. [31], which estimates the posture of a static scan, to estimate a sequence of postures for a given set of frames. To find posture estimates that are stable over time efficiently, we take advantage of the temporal consistency between adjacent frames. That is, we initialize the transformation parameters of frame to the final result computed for frame for . This initialization not only ensures that the resulting posture estimates change smoothly over time, but also leads to an efficient optimization as the initial posture parameters are generally close to the optimal solution. We validate experimentally that this initialization allows to accuractely estimate the postures even in the presence of fast localized movements.

With this initialization, we proceed as in the static case by optimizing the posture using two consecutive energy minimizations. First, we use the anthropometric landmark locations to optimize the posture by minimizing


with respect to the parameters , where is the rigging weight for the -th bone and the -th landmark of , denotes landmark on , and denotes landmark on the current frame .

Second, we use all vertex positions on frame to optimize the posture by minimizing


with respect to the parameters , where is the rigging weight for the -th bone and the -th vertex of and where is the nearest neighbor of the transformed vertex in frame .

5.3 Shape Fitting

This section describes how to change the shape details of the posture-aligned template model to fit to the shape of frame . To simplify notation, in this section, let denote the template model after it was deformed to match the posture of .

The remaining problem is to fit to a frame , where has a similar posture as . We solve this problem using an energy optimization method similar to the one by Allen et al. [1], who deform each vertex of using an affine transformation matrix . That is, the deformed vertex is expressed as , and the goal is to find that moves every vertex of close to the scan while maintaining a smooth deformation field. The smoothness is modeled using the energy where is the edge set of and where denotes the Frobenius norm.

One drawback of this approach is that the Frobenius norm between transformation matrices is used to measure the difference between transformations. This is problematic because a global scaling of the object results in a different relative weighting of the rotation and translation components encoded in .

We remedy this problem by deforming each vertex using a translation and a rotation. The translation is encoded using a translation vector , and the rotation is encoded using a rotation axis and a rotation angle . Let be the () matrix that translates a point by translation vector , and let be the () matrix that rotates a point by angle around . We compute the deformation matrix as . That is, the deformation parameters are expressed with respect to a local coordinate frame centered at .

The goal is to fit to using a smooth deformation field by minimizing


with respect to the deformation parameters and , where is the nearest neighbor of the transformed vertex in , is twice the average edge length in , and contains the set of all points of located within a sphere of radius centered at . Here, is a function that measures the distance of a vertex of the template to the interior of the frame as


where is the outer normal vector of point on .

The first energy term drives the template mesh to the observed data. The second energy term encourages the template to stay within the volume of the observed scan 111We thank the anonymous reviewer for suggesting this energy term.. A similar energy term has recently been introduced by Perbet et al. [25]. We only consider the first two terms corresponding to if the angle between the outer normal vectors of the transformed vertex on the template and its nearest neighbor in the scan is at most degrees. The third energy term encourages a globally smooth deformation of the surface by encouraging close-by points (measured with respect to the local mesh resolution around the points) to have similar deformation parameters. For this energy term, points that are closer in the template mesh obtain a higher weight than points that are farther away.

We initialize to the zero vector, to the normalized vector pointing in direction , and to zero. Following previous work on template fitting [1, 20], our approach starts by setting and to a relatively low value compared to to smoothly deform towards , and subsequently increases the relative influence of and to allow to fit more closely to in localized areas. Specifically, in our implementation, we initially set , , and , and we relax as whenever the energy does not change much. We stop if or .

5.4 Restriction to Learned Shape Model

After fitting to each frame , we have a set of parameterized models. All of these models describe the same subject, and hence, they should all have the same body shape. However, if the subject we track was dressed during the acquisition, the shapes of some or all of the frames may include geometric detail that is not part of the human body shape. We now adjust the shapes such that they lie within the learned shape space of human body shapes.

For simplicity, in the following let denote the parameterized frames found by minimizing Equation 5. Using the learned posture-invariant shape space from Section 4.2, we can express each as a point in . Recall that was learned based on a set of training shapes . If the tracking result found the accurate body shape for each frame, all should correspond to the same point in . However, in practice, due to the presence of noise and clothing, the points are different. We choose the mean of the projections of into to represent the initial body shape estimate. Let denote this representative. If the user is willing to provide confidence weights for each frame that describe how closely the captured scan is to the true body shape, the representative can be computed as a weighted average, where each is weighted by the given corresponding confidence weight. Note that in general, is different from the mean of the learned PCA space . If is located far from the mean shape of the training population (which is the origin of ), it is likely that clothing resulted in tracking results that do not accurately represent the body shape of the subject. In this case, we move to the intersection of the line through and the origin of with the ellipsoid , where is the covariance matrix of the population . That is, we move linearly towards the origin of until

is at most three standard deviations from the origin of


The representative describes the body shape of the captured subject in . Using the learned principal components, we can compute the local coordinates and the scale corresponding to . We now deform each frame to achieve these local coordinates and scale. Recall from Section 4.2 that for any mesh


Here, , and are given and we aim to find vertex positions that satisfy the above equation.

Equation 7 implies that , where is the one-ring neighborhood of . Hence, we can find a solution by deforming the vertices of each frame to minimize


Wuhrer et al. [32] optimize for a single frame using a two-stage process consisting of an iterative method followed by a quasi-Newton optimization that ensures that a good local minimum is found. In our case, however, the use of temporal consistency between adjacent frames during tracking results in frames that provide a good initialization for the quasi-Newton optimization of Equation 8. Hence, we can directly minimize using a quasi-Newton method, which leads to a gain in efficiency.

6 Evaluation

We implemented the proposed approach using C++. To compute (exact) nearest neighbors, the implementation in ANN [3] is used, and to minimize the energies , , , and , a quasi-Newton approach [21] is used.

6.1 Estimating Shape and Posture Using Static Scans

We first evaluate our approach when fitting the proposed statistical model to static input scans of subjects captured with and without loose clothing. In this scenario, we compare the accuracy achieved by our method to that of the variant of the commonly used SCAPE model proposed by Jain et al. [18]. To simplify the presentation, we slightly abuse the notation and refer to this variant as SCAPE model in the following. We used the MPI database [14] to learn the SCAPE model using all models in standard posture for the shape model and using a single model in 35 postures for the posture model. For the shape model, of the variability present in the training data are retained, as we observe empirically that over-fitting does not occur when learning from this database of models in standard posture. For the SCAPE fitting, a constrained optimization is used to find the shape and posture parameters located within three standard deviations of the model mean. The SCAPE fitting iteratively fits to nearest neighbors.

To train our model, we use the scans of all subjects in all available postures of the MPI database. That is, for the same training database, our method is able to leverage more scans for training. For all experiments shown in this section, the 14 landmarks are picked manually and provided as input to both fitting algorithms.

Subjects in minimal tight clothing

We first show an experiment, where we aim to fit the statistical model to input scans representing subjects in minimal tight clothing. To evaluate the fitting accuracy in this case, we divided the subjects in the MPI database into two halves. We used one half to train both the SCAPE model and our model (the 260 scans of subjects ), and the other half was used for testing (the 260 scans of subjects ). For the SCAPE model, again all available shapes in standard posture were used for the shape model and a single subject in 35 postures was used for the posture model, while we used all available scans of half of the database to train our model. The two learned statistical models were then fitted to the remaining models of the database. Figure 3 shows the cumulative plots of the distances of the vertices of the fitting results to their corresponding vertices in the registered MPI database. Our method outperforms SCAPE. Some of the high errors for both methods stem from noise in the database. For SCAPE, many of the high errors are in the area of the torso, which is not always fitted well to the data as no landmarks are used to guide the model in this area, and as consequently, the posture model learned by SCAPE fails to fit accurately to the data. In contrast, our skeleton-based posture fitting usually fits the model well to the data in spite of the lack of landmarks in the torso area.

Figure 3: Results of fitting SCAPE and our model to a subset of the MPI database.

Subjects in casual clothing

To evaluate our algorithm on a database of more challenging static scans, we collected a data set consisting of a total of 18 body scans of 4 subjects dressed in regular casual office clothing in up to 5 postures each using Kinect Fusion [17]. We simultaneously captured the front and back views for every subject in each posture separately using Kinect Fusion and manually merged the two resulting views. Some of the scans (covering all 4 subjects and 5 postures) are shown in the first row of Figure 4. The postures were chosen to resemble the postures used by Balan and Black [5]. We could not use their data directly, as their method takes a small set of input images (not covering the full view of the body), while we require a scan that covers the full body. Note that the scans are corrupted by noise and missing data. For each of the four subjects, we further recorded the height, waist circumference, and chest circumference. We use these measures to evaluate the accuracy of the fitting results by computing the corresponding measurements on the resulting fitted models.

Figure 4: Top: data set of static scans of people dressed in regular clothing. Middle: results of fitting SCAPE model to a single scan. Bottom: results of fitting our model to a single scan.

For both SCAPE and our method, we perform two ways of data fitting. First, we fit the models to each input scan (in a single posture) individually, and second, we fit the models to all postures available for a given subject jointly by solving for a single body shape estimate and multiple posture estimates.

To evaluate the results, we first measure the fitting accuracy by computing the distance between each vertex of the result and its closest point on the input data. Figure 5 summarizes the fitting accuracy. Note that for both options, our method leads to models that are closer to the input data than SCAPE. For our method, the distance to the input data increases when multiple postures are fitted simultaneously. This is to be expected as multiple observations of a dressed subject give more cues about the body shape, which leads to a better body shape estimate that may deviate more from the data, which includes details of clothing. To see that our body shape estimate improves when multiple postures are used, refer to Figure 6 (discussed in detail below), where the improvements can be seen from the reduced standard deviations, which is especially visible for the height measurement. For the SCAPE model, the opposite behaviour can be observed. The reason is that the SCAPE model is not fitted well to a single input scan, as can be seen in Figure 4, which shows some fitting results. Note that the results using our method represent realistic body shapes and postures that are close to the input scans, while this is not always the case for the results using SCAPE. For instance, the following body parts are estimated inaccurately in the results found by SCAPE: the posture of the legs shown in the second column, the posture of the feet shown in the third column, the posture of the upper back shown in the fourth column, and the posture of the head shown in the fifth column.

Figure 5: Cumulative distances when fitting SCAPE and our model to a data set of 18 scans acquired using Kinect Fusion.

Second, we measure the height and circumferences on the fitting results. The circumference measurements are computed by intersecting the torso of the model with a plane parallel to the floor plane and by computing the length of the convex hull of this intersection. The results for the different methods are summarized in Figure 6. While our method predicts the height of the models quite accurately (even though some of the subjects wore shoes during acquisition), the waist and chest circumferences are overestimated because the clothing tricks the method into predicting body shapes with larger circumferences. This is especially true for the waist circumference, where the body shape of the acquired subjects is hidden by large clothing folds, as can be seen in Figure 4. SCAPE leads to a significantly worse estimate of the height, but to better estimates of the circumferences. Note however that while the two estimated circumferences have low error for SCAPE, the estimated body shape is often inaccurate, as can be seen in the chest area of the model shown in the first column of Figure 4. Here, the overall body shape estimate of our method is closer to the input data than the one by SCAPE. Furthermore, the estimates of the circumferences found using SCAPE get worse when the model is fitted to multiple scans simultaneously. The reason is that using multiple scans leads to fitting results that are closer to the data (as can be seen in Figure 5), which leads to overestimated circumferences due to the clothing.

Figure 6: Measurement errors of estimate by SCAPE and our model of a data set of 18 scans acquired using Kinect Fusion. Plot shows means and standard deviations of the errors.

When fitting multiple postures simultaneously using our method, the standard deviations of all measurements decrease, which indicates that the errors get spread more evenly, which is to be expected as the shapes are averaged in the shape space . We observed that for some scans, the measurement errors decrease significantly, while for other scans, there is a slight increase in some of the measurement errors. Having additional observations mainly improves the accuracy of the shape estimate for frames that had a high error with single frame fitting compared to the other available frames. One instance where the errors decrease significantly is the scan shown in the leftmost column of Figure 4. Here, the error on the height, waist, and chest measurements decrease by 4.4cm, 2.1cm, and 2.1cm, respectively, by using all postures instead of a single one. Figure 7 shows how the measurement error decreases when increasing the number of scans used to estimate the body shape from one to five. Note that all errors decrease significantly when using two scans instead of one to compute the shape estimate, while additional frames only lead to minor improvements.

Figure 7: Errors of estimated height, waist circumference, and chest circumference measurements of the scan shown in the leftmost column of Figure 4 with increasing number of scans used to estimate the body shape.
Frame # 1 6 11 16
No Noise Input
Gaussian Noise Input
Outliers Input
Outliers Input
Figure 8: Synthetic noise evaluation. Each row shows the input data and the results of our method.

To conclude, we showed that for the fitting results to the 18 dressed subjects, our method leads to results that represent the overall body shape and posture correctly, while this is not always the case for SCAPE. Furthermore, the results found by our method are closer to the input data than the results found by SCAPE. While two circumference measurements are estimated more accurately using SCAPE than using our method, the overall body shapes predicted using SCAPE are often visually far from the true body shape. Hence, overall, the fitting accuracy of our method is higher than that of SCAPE.

6.2 Tracking Motion Sequences

Next, we evaluate our method for tracking motion sequences showing humans with and without loose clothing.

Synthetic motion sequences

We start by fitting our model to a synthetic motion sequence of a minimally dressed subject obtained by animating a processed scan of the CAESAR database [26] using Pinocchio [6]

. This test allows to evaluate our method in the presence of controlled input noise. The following three types of noise are considered: (1) Gaussian noise with variance of

of the bounding ball radius of the model applied to the input vertices, (2) outliers modeled by perturbing a vertex with probability

along its normal direction by a magnitude that is uniformly distributed in the range

, where is the average edge length of the model, and (3) holes that were added to the input models. For each sequence, we use our algorithm to track the data, and we evaluate the quality of the result by measuring the difference between the vertices on the result and their nearest neighbor in the original (uncorrupted) sequence. The model starts from a standing position, goes into a squatting position, and back to the starting standing position. Figure 8 shows the input models and the results of the first half of the sequence, and Figure 9 shows the means and standard deviations of the distance of our result to the uncorrupted input model for each frame. The following two observations can be made. First, the tracking is stable, which means that there is no significant drift in the later frames. This can be seen as the motion is symmetric w.r.t. the squatting position (frame 16), and as frames corresponding to the same posture in the first and the second half of the motion sequence (i.e. frames and for ) have similar error. This is due to the landmark prediction step that gives a good initialization to the posture fitting. Second, the synthetic noise does not have a significant influence on the results, which shows that our method is robust to different types of noise.

Figure 9: Tracking of synthetic sequence corrupted by different types of noise. Plot shows means and standard deviations of the errors.

Acquired motion sequences

We also evaluate our method when fitting the learned statistical model to motion sequences of dressed subjects acquired using different systems. Since there is no ground truth available for this input data, we evaluate the results visually in this case. We fit our model to three input sequences of a male subject acquired while marching [29] (we use a sequence of 57 frames), a male subject acquired performing a kicking motion [9] (we use a sequence of 39 frames), and a female subject acquired while dancing [9] (we use a sequence of 49 frames). Figure 10 shows the input data and the results of our method for several frames, and results for the full motion sequences can be seen in the supplementary material. Note that in spite of the loose clothing, realistic body shapes are obtained. Furthermore, due to the stable initialization with automatically placed landmarks, the tracking does not fail, even in the case of the fast kicking motion.

Figure 10: Results of tracking motion sequences acquired using different systems. For each example: top shows the input data and bottom shows our result for seven input frames that are evenly distributed in time.

6.3 Limitations

Finally, we outline some limitations of our method. First, by using a skeleton-based deformation to model posture changes, our method may generate unrealistic bending at joints, especially when the data to fit to is missing or unreliable in this area. This can be observed on the right elbow shown in the rightmost frame in the last row of Figure 10. The reason for such artefacts is that muscle bulging and stretching are not modeled in our shape space.

Second, while we have demonstrated that our method can estimate the human body shape and postures for a given input sequence of scans representing a person dressed in regular clothing, our method fails in cases of very loose clothing, such as skirts or dresses. An example where unrealistic body shapes are estimated is shown in Figure 11, which shows the frame of an input sequence of a dancing woman (dataset from de Aguiar et al. [9]). For this sequence, our method computes a valid output in each step. However, the estimated shape and posture of the upper legs is unrealistic. For more extreme cases like a person wearing a wide dress, where a significant portion of the body is obstructed by loose clothing, we expect the landmark prediction method to fail as the intrinsic geometry of the scan no longer resembles the learned shape space. However, we have not observed this problem in our experiments.

Figure 11: Input data and estimated body shape for a frame of a sequence showing a dancing woman wearing a skirt.

Furthermore, there is currently no guarantee that the estimated body shape is inside the observed clothing, even though this must be the case in reality. However, the clothing term used in Equation 5 discourages the estimated shape to protrude from the scan, and for our experiments, the estimated shape is almost always entirely contained within the scan. The general limitation of not guaranteeing that the estimated shape is inside the clothing is shared by other methods that use a SCAPE model to find a shape and posture estimate from an input scan or a set of input images.

7 Conclusion

We proposed an approach to estimate the body shape and postures of dressed human subjects in motion. Our method, which uses a posture-invariant shape space to model body shape variation combined with a skeleton-based deformation to model posture variation, was shown to have higher fitting accuracy than a popular variant of the commonly used SCAPE model [2, 18] when fitting to static scans of both dressed and undressed subjects. Furthermore, we showed that our method performs well on motion sequences of dressed subjects.


We thank Nils Hasler for making the MPI database available, Daniel Vlasic and Christian Theobalt for making their tracking results available, and the volunteers who participated in our scanning experiment. We further thank Gautham Adithya and Monica Vidriales for help in conducting the comparison to the variant of the SCAPE model, and the anonymous reviewers for insightful comments. This work has partially been funded by the Cluster of Excellence Multimodal Computing and Interaction within the Excellence Initiative of the German Federal Government.


  • [1] Brett Allen, Brian Curless, and Zoran Popović. The space of human body shapes: reconstruction and parameterization from range scans. ACM Transactions on Graphics, 22(3):587–594, 2003. Proceedings of SIGGRAPH.
  • [2] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. Scape: shape completion and animation of people. ACM Transactions on Graphics, 24(3):408–416, 2005. Proceedings of SIGGRAPH.
  • [3] Sunil Arya and David M. Mount. Approximate nearest neighbor queries in fixed dimensions. In Symposium on Discrete Algorithms, pages 271–280, 1993.
  • [4] Andreas Baak, Meinard Müller, Gaurav Bharaj, Hans-Peter Seidel, and Christian Theobalt. A data-driven approach for real-time full body pose reconstruction from a depth camera. In IEEE International Conference on Computer Vision, pages 1092–1099, 2011.
  • [5] Alexandru O. Balan and Michael J. Black. The naked truth: Estimating body shape under clothing. In European Conference on Computer Vision, pages 15–29, 2008.
  • [6] Ilya Baran and Jovan Popović. Automatic rigging and animation of 3D characters. ACM Transactions on Graphics, 26(3), 2007. Proceedings of SIGGRAPH.
  • [7] Yinpeng Chen, Zicheng Liu, and Zhengyou Zhang. Tensor-based human body modeling. In

    IEEE International Conference on Computer Vision and Pattern Recognition

    , pages 105–112, 2013.
  • [8] Stefano Corazza, Lars Muendermann, Ajit Chaudhari, T. Demattio, Claudio Cobelli, and Thomas Andriacchi. A markerless motion capture system to study musculoskeletal biomechanics: visual hull and simulated annealing approach. Annals of Biomedical Engineering, 34(6):1019–1029, 2006.
  • [9] Edilson de Aguiar, Carsten Stoll, Christian Theobalt, Naveed Ahmed, Hans-Peter Seidel, and Sebastian Thrun. Performance capture from sparse multi-view video. ACM Transactions on Graphics, 27(3):#98, 1–10, 2008. Proceedings of SIGGRAPH.
  • [10] Edilson de Aguiar, Christian Theobalt, Carsten Stoll, and Hans-Peter Seidel. Marker-less deformable mesh tracking for human shape and motion capture. In IEEE International Conference on Computer Vision and Pattern Recognition, pages 1–8, 2007.
  • [11] Asi Elad and Ron Kimmel. On bending invariant signatures for surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(10):1285–1295, 2003.
  • [12] Juergen Gall, Carsten Stoll, Edilson de Aguiar, Christian Theobalt, Bodo Rosenhahn, and Hans-Peter Seidel. Motion capture using joint skeleton tracking and surface estimation. In IEEE Conference on Computer Vision and Pattern Recognition, 2009.
  • [13] Nils Hasler, Carsten Stoll, Bodo Rosenhahn, Thorsten Thormählen, and H.-P. Seidel. Estimating body shape of dressed humans. Computers and Graphics (Special Issue SMI’09), 33(3):211–216, 2009.
  • [14] Nils Hasler, Carsten Stoll, Martin Sunkel, Bodo Rosenhahn, and Hans-Peter Seidel. A statistical model of human pose and body shape. Computer Graphics Forum (Special Issue of Eurographics 2008), 2(28), 2009.
  • [15] Thomas Helten, Andreas Baak, Gaurav Bharai, Meinard Müller, Hans-Peter Seidel, and Christian Theobalt. Personalization and evaluation of a real-time depth-based full body scanner. In 3D Vision, 2013.
  • [16] Radu Horaud, Matti Niskanen, Guillaume Dewaele, and Edmond Boyer. Human motion tracking by registering an articulated surface to 3-d points and normals. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(1):158–164, 2009.
  • [17] Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard Newcombe, Pushmeet Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew Davison, and Andrew Fitzgibbon. Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera. In Symposium on User Interface Software and Technology, pages 559–568, 2011.
  • [18] Arjun Jain, Thorsten Thormählen, Hans-Peter Seidel, and Christian Theobalt. MovieReshape: tracking and reshaping of humans in videos. ACM Transactions on Graphics, 29:148:1–10, 2010. Proceedings of SIGGRAPH Asia.
  • [19] Ron Kimmel and James Sethian. Computing geodesic paths on manifolds. Proceedings of the National Academy of Sciences, 95:8431–8435, 1998.
  • [20] Hao Li, Robert W. Sumner, and Mark Pauly. Global correspondence optimization for non-rigid registration of depth scans. Computer Graphics Forum, 27(5):1421–1430, 2008.
  • [21] Dong C. Liu and Jorge Nocedal. On the limited memory method for large scale optimization. Mathematical Programming B, 45:503–528, 1989.
  • [22] Lars Muendermann, Stefano Corazza, and Thomas P. Andriacchi. Accurately measuring human movement using articulated icp with soft-joint constraints and a repository of articulated models. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1–6, 2007.
  • [23] Alexandros Neophytou and Adrian Hilton. Shape and pose space deformation for subject specific animation. In 3D Vision, 2013.
  • [24] Judea Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.
  • [25] Frank Perbet, Sam Johnson, Minh-Tri Pham, and Björn Stenger. Human body shape estimation using a multi-resolution manifold forest. In IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  • [26] Kathleen Robinette, Hans Daanen, and Eric Paquet. The CAESAR project: A 3-D surface anthropometry survey. In Conference on 3D Digital Imaging and Modeling, pages 180–186, 1999.
  • [27] Carsten Stoll, Nils Hasler, Juergen Gall, Hans-Peter Seidel, and Christian Theobalt. Fast articulated motion tracking using a sums of gaussians body model. In IEEE International Conference on Computer Vision, pages 951–958, 2011.
  • [28] Gary Tam, Zhi-Quan Cheng, Yu-Kun Lai, Frank Langbein, Yonghuai Liu, David Marshall, Ralph Martin, Xian-Fang Sun, and Paul Rosin. Registration of 3D point clouds and meshes: A survey from rigid to non-rigid. IEEE Transactions on Visualization and Computer Graphics, 19(7):1199–1217, 2013.
  • [29] Daniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan Popović. Articulated mesh animation from multi-view silhouettes. ACM Transactions on Graphics, 27(3):97, 2008. Proceedings of SIGGRAPH.
  • [30] Alexander Weiss, David Hirshberg, and Michael Black. Home 3D body scans from noisy image and range data. In International Conference on Computer Vision, pages 1951–1958, 2011.
  • [31] Stefanie Wuhrer, Chang Shu, and Pengcheng Xi. Landmark-free posture invariant human shape correspondence. The Visual Computer, 27(9):843–852, 2011.
  • [32] Stefanie Wuhrer, Chang Shu, and Pengcheng Xi. Posture-invariant statistical shape analysis using Laplace operator. Computers & Graphics (Special Issue SMI’12), 36(5):410–416, 2012.
  • [33] Shizhe Zhou, Hongbo Fu, Ligang Liu, Daniel Cohen-Or, and Xiaoguang Han. Parametric reshaping of human bodies in images. ACM Transactions on Graphics, 29:126:1–10, 2010. Proceedings of SIGGRAPH.