EllipBody: A Light-weight and Part-based Representation for Human Pose and Shape Recovery

Human pose and shape recovery is an important task in computer vision and real-world understanding. Current works are tackled due to the lack of 3D annotations for whole body shapes. We find that part segmentation is a very efficient 2D annotation in 3D human body recovery. It not only indicates the location of each part but also contains 3D information through occlusions from the shape of parts, as indicated in Figure 1. To better utilize 3D information contained in part segmentation, we propose a part-level differentiable renderer which model occlusion between parts explicitly. It enhances the performance in both learning-based and optimization-based methods. To further improve the efficiency of the task, we propose a light-weight body model called EllipBody, which uses ellipsoids to indicate each body part. Together with SMPL, the relationship between forward time, performance and number of faces in body models are analyzed. A small number of faces is chosen for achieving good performance and efficiency at the same time. Extensive experiments show that our methods achieve the state-of-the-art results on Human3.6M and LSP dataset for 3D pose estimation and part segmentation.



There are no comments yet.


page 1

page 3

page 8


Tilting at windmills: Data augmentation for deep pose estimation does not help with occlusions

Occlusion degrades the performance of human pose estimation. In this pap...

Neural Body Fitting: Unifying Deep Learning and Model-Based Human Pose and Shape Estimation

Direct prediction of 3D body pose and shape remains a challenge even for...

HEMlets PoSh: Learning Part-Centric Heatmap Triplets for 3D Human Pose and Shape Estimation

Estimating 3D human pose from a single image is a challenging task. This...

LASOR: Learning Accurate 3D Human Pose and Shape Via Synthetic Occlusion-Aware Data and Neural Mesh Rendering

A key challenge in the task of human pose and shape estimation is occlus...

TRB: A Novel Triplet Representation for Understanding 2D Human Body

Human pose and shape are two important components of 2D human body. Howe...

PARE: Part Attention Regressor for 3D Human Body Estimation

Despite significant progress, we show that state of the art 3D human pos...

Group Visual Sentiment Analysis

In this paper, we introduce a framework for classifying images according...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

The white box in (a) points out a predicted pose mistaken at the cross of two arms even if the projected joints are correct. With optimization using groundtruth part segmentation shown in (b), the recovered model achieves correct projected segmentation (c). (d) is the corresponding detailed model of our light-weight parametric model.

Recovering human shape from a single image is a challenging task in the computer vision area. This task aims at predicting both human pose and shape parameters simultaneously. It can be adopted to a large variety of applications such as 3D human reconstruction, human computer interaction, etc. However, this task is still unsolved due to the ambiguities in depth and the self-occlusions of human poses.

Recent results have shown that optimizing 3D-2D consistency between a 3D human body model and 2D image cues is beneficial for both training-based and optimization-based approaches [4, 15, 12]. Existing methods mostly fit the projected body model in the image space to 2D keypoints and silhouette but ignore body part information which is critical for resolving depth ambiguity based on occlusion reasoning. As shown in Figure 1, predicted model in (a) is consistent with groundtruth 2D keypoints and silhouettes, but the pose is still incorrect as two forearms have wrong depth. The models which are applicable to represent body parts in (c) can be optimized with correct part segmentation in (b) to achieve the proper pose and shape.

Current representations for human body succeed in developing 3D pose estimation, but still remain a great number of ambiguities. The reason is lacking a proper body model with efficient body part representation in both 2D and 3D. For example, the skeleton is a simple and effective representation for the 3D pose estimation. However, it is impractical to support the reasoning about occlusion and collision and tell the somatotype of the estimated body only with the position of joints. On the other hand, the parametric models like SMPL [19] and SMPL-X [22] utilize thousands of meshes to represent human body and can reconstruct a more detailed human body. These models are expressive, but encode the shape with several latent space parameters which makes it hard to refine the body part independently. Moreover, SMPL is computationally expensive due to the redundant number of faces. Therefore, an intermediate representation applicable to represent body parts will solve the above limitations, which bridges the gap between skeleton and SMPL. This model should be light-weight with fewer vertices and faces to have less inference time.

Due to the lack of 3D data, especially body shape data, most approaches exploit projected keypoints as the 2D supervision. The body part segmentation that provides critical semantic information is rarely mentioned. The reason is that rendering 3D model to 2D image is hard to be differentiable. Differentiable rendering is an approximate way to utilize the segmentation as supervision, refining the human pose and shape via full body silhouettes [26]. However, the differentiable renderer usually focuses on the overall human shape. With the part segmentation, we can infer the orientation, position, length, and thickness of different parts. Moreover, the part segmentation can also indicate the occlusion between different body parts evidently.

In this paper, we propose a simple geometry-based body representation called EllipBody

. The representation utilizes several ellipsoids to represent different body parts and takes the body part length, thickness, and orientations as the explicit parameters. The proposed representation is a light-weight model for differentiable rendering and is flexible to adjust each part. There is an optional way to convert the EllipBody to detailed body model, by using a Multi-Layer Perceptron to re-target the pose and minimizing the ICP loss to achieve the shape parameters of SMPL


To utilize the part segmentation as supervision, we extend the object-level differentiable neural renderer [13] to a part-level differentiable neural renderer. This module takes the silhouettes of each body part as supervision and refines each part of EllipBody iteratively. We also design a depth-aware loss in the part renderer to identify the occlusions between different parts and keep occlusion consistency with respect to part segmentation.

To predict the body shape from a single image, we first train an end-to-end network to obtain the parameters of EllipBody as an initial prediction. We perform a post-optimization to minimize the part segmentation loss to correct the errors in network prediction. With the EllipBody model and our part-level neural renderer, the performance on Human 3.6M [10] and LSP [11] dataset is competitive to the state-of-the-art.

With all that in mind, our contributions are three-fold.

  • [noitemsep,topsep=0pt]

  • We propose an intermediate human body representation (EllipBody) for human pose and shape recovery. It is light-weight and part-based to accelerate part rendering and optimization processes.

  • We propose an occlusion-aware part-level differentiable renderer (PartDR) to utilize part segmentation as supervision for learning.

  • We implement a framework containing a deep neural network and an iterative post-optimization with part segmentation loss computed by PartDR, which achieves the state-of-the-art performance on human pose and shape recovery.

2 Related Work

2.1 Representations for Human Bodies

Various representations have been provided for human pose and shape estimation. Among these representations, 3D skeleton is simple and effective to represent human pose and has been adopted to lots of previous methods [31, 24, 29, 30, 28]. In skeleton-based methods, joints position [20, 28] and volumetric heatmaps [30, 23] are often used to predict the 3D skeleton in neural network and significantly improve the 3D human pose estimation. However, these methods only focus on the pose estimation while the human shape is often ignored. Moreover, the 3D skeleton is represented by several joints and kinematic constraint is often neglected.

Figure 2: Framework. Our pipeline is divided into two major parts: (a) CNN-based network to predict 1) EllipBody parameters, 2) heatmaps, and 3) segmentation, where both 2) and 3) can be supervised by 2D annotations; (b) part-level differentiable renderer to produce segmentation, which can then be optimized by predicted part segmentation from (a).

The human shape recovery methods focus on statistical parametric models to represent the human pose and shape simultaneously. These models are often generated from human body scan data and encode both pose and shape parameters. Loper et al. [19] propose the skinned multi-person linear model (SMPL) to generate a realistic human body with thousands of triangular faces. Recently, Pavlakos et al. [22] provide a detailed parametric model SMPL-X that is able to model human body, face, and hands. These models contain thousands of vertices and faces and can represent more detailed human body and shape. However, the large number of vertices and faces often slows down the optimization-based methods [12]. As these parametric models perform implicit parameters to indicate human shape, it is hard to perform partial refinement independently. We proposed a light-weight model as an intermediate representation for current models. The proposed model has limited vertices and faces and is able to further speed up the optimization-based method.

2.2 Differentiable Rendering

Rendering connects the image plane with the 3D space. Recent works on inverse graphics [8, 18, 13] put great effort to make this process differentiable that make the renderer system as an optional module in learning-based approaches. Loper et al. [18] propose a differentiable renderer called OpenDR, which obtains derivatives with respect to the model parameters. Kato et al. [13] present a neural renderer which approximates gradient as a linear function during rasterization. These methods support recent approaches [21, 26]

exploiting the segmentation as the supervised labels to improve their performance. Previous differentiable renderers output the shape and textures successfully but ignore different parts of the object. We extend the differentiable renderer to part-level and proposed a depth-aware loss function for part-level rendering. Therefore, we can explore the spatial relations between various parts in a single model through our proposed renderer module.

2.3 Human Shape Recovery

Recovering both pose and shape first follows the optimization-based solutions. Guan et al. [9] optimize the parameters of SCAPE [3] model with the 2D keypoints annotation. Bogo et al. [4] employ a CNN to achieve the 2D keypoints, then propose SMPLify to optimize the parameters of SMPL model[19]. Lassneret al. [15] take the silhouettes and dense 2D keypoints as additional features, using SMPLify method to obtain more accurate results. Recent expressive human model SMPL-X [22] integrates face, hand, and full body together. Pavlakos et al. optimize the VPoser [22]

, which is the latent space of the SMPL parameters, together with a collision penalty and a gender classifier. While optimization-based approaches often take a long time to reach the final result, using deep neural network to regress the parameters is the majority trend recently. Thus Pavlakos

et al. [25] use a CNN to estimate the parameters from the silhouettes and 2D joint heatmaps. Kanazawa et al. [12] also present an end-to-end network, called HMR, to predict the parameters of the shape. They employ a large dataset to train a discriminator to promise the available parameters. Kolotouros et al. [14]

propose a framework called GraphCMR which regresses the position of each vertex through a graph CNN. These solutions usually suppose a fixed camera intrinsic and extrinsic parameters, which may cause uncontrolled results due to the lack of generalization. Considered the shortages and benefits of both optimization and CNN-based methods, we employ a convolutional neural network to estimate the pose and shape parameters of EllipBody, then refined the model through the optimization-based process in limit iterations.

3 Methodology

The goal of our work is to estimate an entire configuration of the human body from a single color image. Our framework is illustrated in Figure 2. Given an image of a human body, the backbone Deep Convolutional Network infers the parameters of the model, EllipBody. Then we feed the EllipBody into a part-level differentiable renderer (PartDR) to produce individual silhouettes for the body parts. The objective function is to minimize the difference between rendered and predicted part segmentation. Finally, we use a linear regressor to acquire realistic body shape.

Figure 3: EllipBody: Skeleton and Shape. Ellipsoids generated by icosahedrons with repeated surface subdivisions are assembled into EllipBody, as a balanced intermediate representation for both skeletons and shapes.

3.1 EllipBody: An Intermediate Representation

The statistical parametric models have significantly benefited the human shape recovery society, however, this kind of models still have specific limitations. These models encode human body shapes into latent space parameters and utilize these parameters to generate detailed human body mesh. The latent parameters represent human shape prior implicitly which makes it hard to change human part independently. Moreover, the detailed mesh may slow down the optimization process due to redundant faces.

We proposed a light-weight and flexible intermediate representation, called EllipBody, to speed up the optimization process and disentangle human body parts. We utilize ellipsoid to represent human body parts and take the position, orientation and length of semi-principal axes of each ellipsoid as explicit parameters. The EllipBody representation contains both human skeleton and surfaces, and is able to adjust human parts independently. We choose ellipsoid as the human part representation as the human part silhouettes are mostly ellipsoidal and the ellipsoid produces continuously projection in different views.

The proposed representation is an expansion of human skeleton and represents human body parts, e.g. limbs, torso, and head, with parametric ellipsoid. As each ellipsoid contains three independent semi-principal axes, we select one of them as the skeleton axis and the other two as the shape axes. As shown in Figure 3, we assemble the ellipsoids with the bones along the skeleton axes and locate the end points of the ellipsoids with human joints belongs to the bones. After the assembling, the ellipsoid is a more powerful alternate to human skeleton and is applicable to represent both human pose and shape independently.

The parameters of the ellipsoid contains the bone length and part thickness along different axes . We use the position and global rotation of each ellipsoid as the pose parameters. The represents the position of its center and indicate the global rotation of the ellipsoid. The proposed EllipBody is formulated as follow. , where


As the human body is symmetric, ellipsoids in EllipBody share the parameters when indicating the same category of human parts. Therefore, we reduce the number of semi-principal axes parameters from to . We divide EllipBody parameters in two parts, for part lengths and for thickness. The simplified parameters are shown in Table 1. The torso, feet, and hands keep asymmetrical as human body does.

In the inference phase, we first use the ellipsoid parameters to reconstruct the EllipBody model and then use the pose parameters to recover the human pose with forward kinematics. The process is shown as follow,



is an offset vector indicating the direction from its parent to the current joint. So

denotes the local position of joint in joint’s coordinate. Similarly, We use and to compute the centers of the ellipsoids . The only change is to modify the offset vectors .

Part length Shape Part length Shape
Ass Upper legs
Abdomen Lower legs
Chest Feet
Neck Upper arms
Shoulders Fore arms
Head Hands
Table 1: Shape Parameters of EllipBody Parts. denotes the length of the ellipsoids. denotes the thickness of them.

The proposed model is an expansion of 3D human skeleton, which is able to represent the pose and shape of body parts simultaneously. As shown in Figure 3, we extract specific end points and center points from the reconstructed EllipBody model as the 3D human skeleton for pose. We divide each ellipsoid into several triangles to obtain human mesh for shape. For convenience, we exploit the icosahedron in our implement. The icosahedron is a 20-face polygon whose faces are equilateral triangles which can be subdivided to generate more subtle surfaces as the classic geodesic polyhedron [32].

Figure 4: Part-level Differentiable Renderer. The left side illustrates all four possible coplanar cases on approximate gradients of intensity and their corresponding derivatives. The right side illustrates the vertical case, i.e. -axis gradients take effect over the rendered parts. The yellow pixel indicates where the rendering loss come from.

3.2 PartDR: Part-Level Differentiable Renderer

Human part segmentation provides effective 3D evidence, e.g. boundaries, occlusions, and locations, to infer the relationship between body parts. We extend the object-level differentiable neural renderer proposed by Kato et al. [13] to a part-level differentiable renderer (PartDR). The proposed part-level differentiable renderer draws human parts independently and generates both face mapping and part mapping. In back-propagation, we compute the part-level approximate derivatives following the previous method[13] but omit the region that is occluded by other human parts. We also design a depth-aware occlusion loss to revise the incorrectly occluded region.

Rendering the human parts

Given the camera settings and EllipBody parameters, we obtain two rendered results after the rendering process, the face index and the part index . The face index map indicates the correspondence between image pixels and faces of human mesh. We use to represent the nearest face , the th face of the th part, that is projected on image position . The part index is defined as a binary array which indicates the pixel belongs to the th part and otherwise .


is the EllipBody model, and is the projection matrix.

Approximate derivatives for part rendering

In the training process, we follow the neural renderer approach that is proposed in [13] to compute approximate derivatives of each part. The neural renderer is a differential renderer and utilizes the approximate gradient of rasterization to enable the end-to-end rendering in neural network. It efficiently approximates the gradients of vertex coordinates with respect to the rendered image.

We use the function to indicate the rendering function of pixel and show the derivatives of the th vertex of face as follow,


We only show the derivatives on the x-axis for simplify. is the rendered value of pixel . is the residual of groundtruth P and . is the x-coordinate of the current vertex and is a new x-coordinate that makes collide the edge of the rendered face.

The neural renderer is able to be performed to single ellipsoid. However, the proposed EllipBody contains multiple ellipsoids and may lead to inaccurate approximates due to the self occlusion. We omit the self-occluded region, shown in red triangle in Figure 4, in the derivative approximation.


We propose derivatives on the -axis (direction on depth) as an extension for the part-level neural renderer. We omit the derivatives in the occluded regions and then design a new approximation of the derivatives on the -axis to refine the incorrectly occluded part. As shown in Figure 4, we first find the occluded face. Then we compute the depth derivatives directly proportional to the distance between the occluded point and the one occlude it. The derivative is shown as


is the distance between the two faces. is the length between two points. is the corresponding point whose projecting point is . The line form to intersects to at is a variable to magnitude the term.

3.3 EllipBody Estimation

We proposed an end-to-end pipeline to estimate the parameters of EllipBody. As shown in Figure 2, a CNN-based backbone extracts the features from a single image first. Base on the image feature, we regress the parameters of the pose and shape. After that, we optimize the objective function to minimize the part segmentation loss of rendered parts.

Network Design.

As the previous works have great success in training a Deep CNN for human pose estimation, we take a simple baseline as our encoder to extract the features from an image. The features are fed into the regression block that has a similar structure in [20]. We obtain , , and . Here is local rotation vectors, thus we compute the global rotation by forward kinematic [16, 17] of EllipBody. Note that output

may not be available rotations, so we employ Gram–Schmidt process

[6] to guarantee the validity. The network also regresses the camera parameters for a weak-perspective model proposed by Kanazawa et al. [12]. The skeleton and the vertices of mesh can be calculated from , , and as mentioned in Section 3.1. Given the camera parameters and EllipBody, PartDR outputs part maps as the prediction of part segmentation. Also, we project the 3D skeleton to obtain 2D keypoints .

The loss function is composed of three terms, including reconstruction loss on 3D joints, projection loss on 2D keypoints. and part segmentation loss produced by PartDR. We integrates 2D annotations of in-the-wild images as weakly supervision.


, and are weights for each loss. We set for images that only have 2D annotations.

Optimization with Part Segmentation.

The previous methods have shown the importance of optimization after neural network prediction. We also adopted the optimization procedure to refine EllipBody estimation. Moreover, as the EllipBody is a part based model, we are able to perform the optimization on part segmentation data. We formulate the objective function of the optimizaiton as follow,


are parameters of EllipBody, and is weakly-perspective camera settings. and describe the same loss in equation 7. and are regularization term.

We employ the part segmentation predicted from [21] as the target. Since the proposed network proposed a accurate initialization parameters, The optimization process can improve the joints position, ordinal joints and body shapes simultaneously in small number of iterations.

3.4 Convert EllipBody to SMPL

To visualize the detailed body model, we train a Multi-Layer Perceptron (MLP) to convert the pose of EllipBody to the pose parameter of SMPL [19] models. First we empirically convert rotation vectors of and to rotation matrices through Rodrigues formula. The loss function is


After that, we perform Iterative Closest Points (ICP) to obtain the rotation and translation between the two models. The objective function is shown as


is the Iterative Closest Points (ICP) process [5] and are vertices in -th part of EllipBody, are vertices in corresponding part of SMPL model.

4 Experiments

4.1 Datasets


It is a large scale human pose dataset that contains complete motion capture data. It also provides images, camera settings, part segmentation, and depth maps. We use original mocap pose data for EllipBody and combine its body part segmentation into 14 parts. We use subjects S1, S5, S6, S7 and S8 as training data, and test on S9 and S11. We employ two popular error metric Mean Per Joint Position Error (MPJPE) and Reconstruction Error (PA-MPJPE) for evaluation.

Rec. Error
Akhter & Black [1] 181.1
Ramakrishna et al. [27] 157.3
Zhou et al. [34] 106.7
SMPlify [4] 82.3
Lassner et al. [15] 80.7
Pavlakos et al. [25] 75.9
NBF [21] 59.9
HMR [12] 58.1
GraphCMR [14] 51.9
Ours 51.4
Ours+Optimization 47.6
Table 2: Detailed results on Human3.6M [10]. Numbers are reconstruction errors (mm) for 17 joints, also known as PA MPJPE. The numbers are taken from the respective papers.
FB Seg. Part Seg.
acc. f1 acc. f1
SMPLify on GT [4] 92.17 0.88 88.82 0.67
SMPLify [4] 91.89 0.88 87.71 0.64
SMPLify on [26] 92.17 0.88 88.24 0.64
HMR [12] 91.67 0.87 87.12 0.60
Bodynet [31] 92.75 0.84 - -
GarphCMR [14] 91.46 0.87 88.69 0.66
PartDR+SMPL on GT 94.03 0.91 91.91 0.79
PartDR+EllipBody on GT 94.74 0.92 93.26 0.84
PartDR+EllipBody+Pred.Part 92.13 0.88 90.70 0.74
Table 3: Segmentation evaluation on the LSP test set. The numbers are accuracies and f1 scores. The approaches includes both regression-based and optimization-based approaches. Our approach optimized with part segmentation reaches the state-of-the-art.


It is an in-the-wild dataset, which collects high-quality samples from MPII [2], LSP [11], and FashionPose [7]. There are 8515 images, 7818 for training and 1389 for testing. Each image has corresponding SMPL parameters. We render each part of SMPL models as groundtruth.


It is a 2d pose dataset, which provides part segmentation annotations. We use the test set of this dataset to evaluate the accuracy and f1 score of part segmentation.

Model Loss MPJPE()
3D Joints 104.5
SMPL (full) 75.9
SMPL (part) 67.1
EllipBody 73.8
EllipBody (full) 67.1
EllipBody (seg) 65.2
EllipBody (full) 64.1
EllipBody (part) 62.8
Table 4: Comparison on Human3.6M of different parametric model and with or without segmentation losses. The evaluation method is MPJPE in millimeter.

4.2 Implement Details

To train the regression network, we adopt the backbone pretrained by Xiao et al. [33]. The dimension of the regression model is 1024, and each regressor is stacked with two residual blocks. The input image size is , while output segmentation size is also . We use Adam optimizer, and a batch size of 128, with learning rate

to train the model for 70 epochs without segmentation loss

. Then we add segmentation loss and reduce learning rate to for additional 30 epochs. When optimizing the EllipBody predicted by the network, we use Adam optimizer with learning rate . The max number of iterations is 50 times. Base on the experimental results, we set . The target segmentation is predicted by RefineNet proposed by Omran et al. [21].

4.3 Comparing with the State-of-the-Arts.

3D Pose Estimation.

Figure 5: Optimization Performance on LSP Dataset. Red line illustrates forward time consumption, while Blue line illustrates the accuracy of part segmentation. We mark the number of faces for our models and SMPL, and our models with repeat times of surface subdivisions from zero to four are denoted by terms from to respectively.
Figure 6: Qualitative Results. Human shape recovery for LSP datasets. Images from left to right indicate: 1) original image, 2) GraphCMR [14] mesh, 3) EllipBody part segmentation, 4) EllipBody mesh, 5) detailed model from EllipBody.

We compare our approach with other state-of-the-art methods for 3D pose estimation on Human3.6M. The results are presented in Table 2. 3D poses only predicted by network is competitive to other baselines. After optimization, the Reconstruction Error further decreases by , benefiting from reliable body part segmentation. We clarify that different annotations are used in different methods. Kolotouros et al. [14] utilize the 3D meshes of SMPL on Human3.6M and UP-3D dataset as the supervision. Kanazawa et al. [12] use additional images with 2D keypoint annotations. Omran et al. [21] only train the model on Human3.6M, while Pavlakos et al. [25] do not use any Human3.6M data. Our method employs additional part segmentation annotations in Human3.6M and UP-3D.

Part Segmentation.

To evaluate the shape recovery results, we present the comparison with other previous works on LSP test dataset by part segmentation evaluation as shown in Table 3. Note that both learning-based and optimization-based are listed. We first use SMPL as the representation and optimize the result with groundtruth. The accuracy raises by on foreground and background segmentation, and on part segmentation. When changing the model to EllipBody, the accuracy for part segmentation raises additionally by . The experimental results show that our methods shorten the gap between the full segmentation and part segmentation. Then we take the predicted part segmentation directly from [21] to optimize EllipBody. Ours outperforms other methods, and the full body segmentation is competitive to BodyNet [31].

4.4 Ablative Study

Effectiveness of EllipBody and Part-Level Differentiable Renderer.

We investigate the effectiveness of our proposed light-weight model, EllipBody, in the network for 3D pose estimation. To this end, we compare EllipBody with another popular parametric model called SMPL, and employ solely 3D joint positions as our baseline. As shown in Table 4, both EllipBody and SMPL perform better than baseline due to embedding the body priors in their models. When only using 2D annotations including 2D keypoints and segmentation, EllipBody outperforms SMPL with MPJPE error metric in Human3.6M on P1. Note that in SMPL, 3D joints are regressed when the whole mesh is confirmed, which means the joint positions are related to shape parameters . EllipBody with explicit parameters for the length of bones inferences the skeleton and the mesh of body separately.

Beyond the choice of the model, we verify the insight that part segmentation is the 2D annotation with 3D information by comparing full body silhouettes and part silhouettes as the supervision. Even if only applying full body segmentation, MPJPE decreases by with EllipBody. When applying part segmentation, error decreases by using SMPL model and using EllipBody. The results () are close to the one that appending 3D annotations ().

Optimization Performance

Figure 5 shows the performances of different body model configurations over optimization process. Since EllipBody can be subdivided to increase the number of faces, we explore the influence on optimization with the various number of model faces. We find that fewer number of the faces significantly speed up the fitting process. However, the accuracy of part segmentation on LSP test dataset will not keeps increasing because the area of a single face is less than a pixel while the size of images produced by PartDR is .

4.5 Qualitative Evaluation

Figure 6 illustrates the qualitative results comparing with one of the state-of-the-art methods [14]. With the part segmentation, the human body predicted from a single image has more accurate poses. Figure 7 shows the EllipBody and SMPL with different body proportions. Although two models both work well and can be converted to each other, the parameters of EllipBody are interpretable due to the explicit meanings of part lengths and thickness.

Figure 7: SMPL with EllipBody. As the high flexibility in parameters of EllipBody, we can effortlessly control the somatotype and convert it into other realistic body shapes.

5 Conclusion

In this paper, we present an approach to utilize the part segmentation as supervision to improve the performance in human pose and shape recovery. To this end, we propose a light-weight and part-based human model to generate the skeleton and the shape of body parts efficiently. We also extend a differentiable mesh renderer to part-level that has the ability to recognize the occlusion between human parts. The proposed methods enhance the performance in precision and speed for both training-based and optimization-base approaches.


  • [1] I. Akhter and M. J. Black (2015) Pose-conditioned joint angle limits for 3d human pose reconstruction. In CVPR, pp. 1446–1455. Cited by: Table 2.
  • [2] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele (2014) 2d human pose estimation: new benchmark and state of the art analysis. In CVPR, pp. 3686–3693. Cited by: §4.1.
  • [3] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis (2005) SCAPE: shape completion and animation of people. In ACM Transactions on Graphics (TOG), Cited by: §2.3.
  • [4] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black (2016) Keep it smpl: automatic estimation of 3d human pose and shape from a single image. In ECCV, pp. 561–578. Cited by: §1, §2.3, Table 2, Table 3.
  • [5] D. Chetverikov, D. Svirko, D. Stepanov, and P. Krsek (2002) The trimmed iterative closest point algorithm. In Object Recognition Supported by User Interaction for Service Robots, Vol. 3, pp. 545–548. Cited by: §3.4.
  • [6] J. W. Daniel, W. B. Gragg, L. Kaufman, and G. W. Stewart (1976) Reorthogonalization and stable algorithms for updating the gram-schmidt qr factorization. Mathematics of Computation 30 (136), pp. 772–795. Cited by: §3.3.
  • [7] M. Dantone, J. Gall, C. Leistner, and L. Van Gool (2014) Body parts dependent joint regressors for human pose estimation in still images. TPAMI 36 (11), pp. 2131–2143. Cited by: §4.1.
  • [8] M. de La Gorce, N. Paragios, and D. J. Fleet (2008) Model-based hand tracking with texture, shading and self-occlusions. In CVPR, pp. 1–8. Cited by: §2.2.
  • [9] P. Guan, A. Weiss, A. O. Balan, and M. J. Black (2009) Estimating human shape and pose from a single image. In ICCV, pp. 1381–1388. Cited by: §2.3.
  • [10] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu (2014) Human3. 6m: large scale datasets and predictive methods for 3d human sensing in natural environments. TPAMI 36 (7), pp. 1325–1339. Cited by: §1, Table 2.
  • [11] S. Johnson and M. Everingham (2010) Clustered pose and nonlinear appearance models for human pose estimation. In BMVC, Note: doi:10.5244/C.24.12 Cited by: §1, §4.1.
  • [12] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik (2018-06) End-to-end recovery of human shape and pose. In CVPR, Cited by: §1, §2.1, §2.3, §3.3, §4.3, Table 2, Table 3.
  • [13] H. Kato, Y. Ushiku, and T. Harada (2018) Neural 3d mesh renderer. In CVPR, Cited by: §1, §2.2, §3.2, §3.2.
  • [14] N. Kolotouros, G. Pavlakos, and K. Daniilidis (2019) Convolutional mesh regression for single-image human shape reconstruction. In CVPR, pp. 4501–4510. Cited by: §2.3, Figure 6, §4.3, §4.5, Table 2, Table 3.
  • [15] C. Lassner, J. Romero, M. Kiefel, F. Bogo, M. J. Black, and P. V. Gehler (2017) Unite the people: closing the loop between 3d and 2d human representations. In CVPR, Vol. 2, pp. 3. Cited by: §1, §2.3, Table 2.
  • [16] K. Lee and D. K. Shah (1988)

    Kinematic analysis of a three-degrees-of-freedom in-parallel actuated manipulator

    IEEE Journal on Robotics and Automation 4 (3), pp. 354–360. Cited by: §3.3.
  • [17] K. Liu, J. M. Fitzgerald, and F. L. Lewis (1993) Kinematic analysis of a stewart platform manipulator. IEEE Transactions on Industrial Electronics 40 (2), pp. 282–293. Cited by: §3.3.
  • [18] M. M. Loper and M. J. Black (2014) OpenDR: an approximate differentiable renderer. In ECCV, pp. 154–169. Cited by: §2.2.
  • [19] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black (2015) SMPL: a skinned multi-person linear model. ACM Transactions on Graphics (TOG) 34 (6), pp. 248. Cited by: §1, §1, §2.1, §2.3, §3.4.
  • [20] J. Martinez, R. Hossain, J. Romero, and J. J. Little (2017) A simple yet effective baseline for 3d human pose estimation. In ICCV, Cited by: §2.1, §3.3.
  • [21] M. Omran, C. Lassner, G. Pons-Moll, P. Gehler, and B. Schiele (2018)

    Neural body fitting: unifying deep learning and model based human pose and shape estimation

    In 3DV, pp. 484–494. Cited by: §2.2, §3.3, §4.2, §4.3, §4.3, Table 2.
  • [22] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black (2019) Expressive body capture: 3d hands, face, and body from a single image. In CVPR, Cited by: §1, §2.1, §2.3.
  • [23] G. Pavlakos, X. Zhou, and K. Daniilidis (2018) Ordinal depth supervision for 3d human pose estimation. In CVPR, pp. 7307–7316. Cited by: §2.1.
  • [24] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis (2017) Coarse-to-fine volumetric prediction for single-image 3d human pose. In CVPR, pp. 7025–7034. Cited by: §2.1.
  • [25] G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis (2018-06) Learning to estimate 3d human pose and shape from a single color image. In CVPR, Cited by: §2.3, §4.3, Table 2.
  • [26] G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis (2018) Learning to estimate 3d human pose and shape from a single color image. In CVPR, pp. 459–468. Cited by: §1, §2.2, Table 3.
  • [27] V. Ramakrishna, T. Kanade, and Y. Sheikh (2012) Reconstructing 3d human pose from 2d image landmarks. In ECCV, pp. 573–586. Cited by: Table 2.
  • [28] M. Straka, S. Hauswiesner, M. Rüther, and H. Bischof (2011) Skeletal graph based human pose estimation in real-time.. In BMVC, pp. 1–12. Cited by: §2.1.
  • [29] X. Sun, J. Shang, S. Liang, and Y. Wei (2017) Compositional human pose regression. In ICCV, pp. 2602–2611. Cited by: §2.1.
  • [30] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei (2018) Integral human pose regression. In ECCV, pp. 529–545. Cited by: §2.1.
  • [31] G. Varol, D. Ceylan, B. Russell, J. Yang, E. Yumer, I. Laptev, and C. Schmid (2018) BodyNet: volumetric inference of 3d human body shapes. In ECCV, pp. 20–36. Cited by: §2.1, §4.3, Table 3.
  • [32] M. J. Wenninger (1974) Polyhedron models. Cambridge University Press. Cited by: §3.1.
  • [33] B. Xiao, H. Wu, and Y. Wei (2018) Simple baselines for human pose estimation and tracking. In ECCV, Cited by: §4.2.
  • [34] X. Zhou, M. Zhu, S. Leonardos, and K. Daniilidis (2017) Sparse representation for 3d shape estimation: a convex relaxation approach. TPAMI 39 (8), pp. 1648–1661. Cited by: Table 2.