SMPLR: Deep SMPL reverse for 3D human pose and shape recovery

12/27/2018 ∙ by Meysam Madadi, et al. ∙ 8

A recent trend in 3D human pose and shape estimation is to use deep learning and statistical morphable body models, such as the parametric Skinned Multi-Person Linear Model (SMPL). However, regardless of the advances in having both body pose and shape, SMPL-based solutions have shown difficulties on achieving accurate predictions. This is due to the unconstrained nature of SMPL, which allows unrealistic poses and shapes, hindering its direct regression or application on the training of deep models. In this paper we propose to embed SMPL within a deep model to efficiently estimate 3D pose and shape from a still RGB image. We use 3D joints as an intermediate representation to regress SMPL parameters which are again recovered as SMPL output. This module can be seen as an autoencoder where encoder is modeled by deep neural networks and decoder is modeled by SMPL. We refer to this procedure as SMPL reverse (SMPLR). Then, input 3D joints can be estimated by any convolutional neural network (CNN). To handle input noise to SMPLR, we propose a denoising autoencoder between CNN and SMPLR which is able to recover structured error. We evaluate our method on SURREAL and Human3.6M datasets showing significant improvement over SMPL-based state-of-the-art alternatives.



There are no comments yet.


page 1

page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D human pose estimation from still RGB images remains an open challenge. 3D human pose estimation needs to deal with changes in lighting conditions, cluttered background, occlusions, inter and intra subject pose variability, as well as ill-posed depth ambiguity. Due to its nature, accurate annotation of captured data is not a trivial task and most available datasets are captured under controlled environments [5, 20, 12].

Figure 1: Illustration of proposed model output. Given an input RGB image, an initial estimation of 3D joint locations is applied based on volumetric stacked hourglass [14]. Our denoising autoencoder model is able to recover pose from structured error. Finally, body mesh is rendered by SMPL based on the proposed SMPL reverse strategy. Green line shows ground truth and red line shows 3D estimated joints.

Estimation of 3D joints locations has a number of applications in computer vision, including human-robot interaction, gaming, or sports performance analysis, just to mention a few. However, 3D joints do not implicitly show the anatomy of the body shape. Being able to estimate body shape or mesh along with joints allows us to increase method applicability, such as movie editing, body soft-biometrics measurements or cloth retexturing, among others. Recently, researchers started to build deep models able to estimate body mesh [18, 6, 16] by the design of realistic morphable models like Skinned Multi-Person Linear Model (SMPL) [8]. While early SMPL-based solutions [1, 7] are proposed to use regularization to constraint gradients, recent works apply adversarial training [6], or directly estimate body volume in multi-task CNNs [26].

In this paper, we propose a deep neural network inspired in SMPL model to estimate 3D human pose and mesh given a still RGB image. SMPL is a parametric model that renders a realistic mesh based on PCA shape components along with relative angle-axis rotations of joints computed with respect to their father’s node in a defined kinematic tree. Instead of directly regressing SMPL pose and shape parameters from input image, which may generate undesirable artifacts in unseen data

[6], we propose to regress these parameters from an intermediate set of representations. Unlike previous works [18, 16] which use 2D joints or part segmentation as intermediate representations, we use 3D joints and demonstrate that previous representations are suboptimal for pose and shape regression tasks. We claim such a regression does not implicitly handle input noise, causing an offset error embedded in the model. Therefore, we also design a denoising autoencoder network [28] as an extra module able to handle structured error.

In more detail, given a CNN, we estimate a set of 3D joints and landmarks which are fed into a denoising autoencoder (DAE) network with skip connections. DAE plays as a bridge between CNN and SMPL module. In the final module, SMPL parameters are regressed. We call this procedure SMPL reverse (SMPLR). One can imagine it as an encoder-decoder module where intermediate embeddings are pose and shape components. We define encoder as a number of Multi-layer Perceptron (MLP) Networks while decoder is SMPL. We note that, by the use of DAE, the gradients coming from SMPLR get smoothed allowing a more stable end-to-end training. Additionally, given any 2D pose estimation network, we are able to lift 2D joints to 3D and feed it to SMPLR. We show the proposed model output in Fig.

1. In summary, our main contributions are as follows:

  • We build a denoising autoencoder that learns structured error from input data. The model transforms 3D joint predictions to get a more human-consistent pose, enforcing symmetry and proportions on bone lengths.

  • We analyze SMPL model and design two MLP networks to regress SMPL parameters from 3D joints given by DAE. We gain an improvement over chosen CNN by end-to-end training with SMPL. This allows the inference of human body mesh from a sparse point representation. Furthermore, SHN can benefit from these models through backpropagation of errors.

  • Throughout our experiments, we demonstrate that it is possible to obtain an accurate human body model from a set of joint and landmark predictions. Finally, results show how SHN estimations can benefit from these MLP and SMPL on end-to-end training. We obtain state of the art results for SMPL-like architectures on H3.6M [5] and SURREAL [27] datasets111Code will be made publicly available after publication of the paper..

2 Related work

In this section we review state-of-the-art papers for 3D human pose estimation from single RGB images.

Lifting 2D to 3D. Depth regression given 2D joints has been an active research topic in 3D human pose recovery. Nevertheless, this is an ill-posed problem where several 3D poses can be projected to the same 2D joints. Chen and Ramanan [3] show that copying depth from 3D mocap data can provide a fair estimation when a nearest 2D matching is given. However, Moreno [13] shows that distance of random pairs of poses has more ambiguity in Cartesian space than Euclidean distance matrix. Advances in recent works show that directly using simple [11] or cascade [4] MLP networks can be more accurate. Additionally, 2D joints can be noisy with wrong estimations, making previous solutions suboptimal. In this regard, Yang et al. [29] use adversarial training and benefit from available 3D data along with 2D data to infer depth information. In our case, the proposed denoising autoencoder is used to lift 2D pose to 3D in the lack of 3D ground truth data, providing accurate input to SMPLR.

Direct regression. It refers to regressing 3D pose directly from RGB image. Due to the nonlinear nature of the human pose, 3D pose regression without modeling correlation of joints is not a trivial task. Brau and Jiang [2]

estimate 3D joints and camera parameters without direct supervision on them. Instead, they use several loss functions for projected 2D joints, bones size and independent Fisher priors. Sun

et al. [23] propose a compositional loss function based on relative joints with respect to a defined kinematic tree. They separate 2D joints and depth estimation in the loss.

Probability maps

are probability vectors assigned to each pixel/volume such that for each joint a likelihood is computed. In this way, 2D heatmaps have visual evidence in the image and different solutions are applied to infer third dimension. For instance, Tome

et al. [25] combine 2D belief maps with 3D mocap data in an iterative way. They train a probabilistic PCA to learn distribution of 3D mocap data. Then at each iteration they minimize a function based on 2D belief map and learned 3D model to find the most probable 3D pose. Since probability maps are dense predictions, fully convolutional networks are usually applied by different approaches. Luo et al. [9] extend SHN to estimate 2D joint heatmaps along with limb orientation map in each stack. Pavlakos et al. [17] extend SHN to output 3D volumetric data by a coarse-to-fine architecture where at each stack the third dimension is linearly increased with 2D heatmaps. Following this trend, we extend SHN to volumetric heatmaps with a fixed depth resolution 16 and show it is working better than coarse-to-fine approaches. Recently, Nibali et al. [15] propose marginal heatmaps from different axis viewpoints. They compute 3D joints based on soft-argmax and apply multiple loss functions on the heatmaps and 3D joints.

Figure 2: System pipeline. We extend stacked hourglass network to volumetric heatmaps and feed estimated joints to denoising autoencoder module. We use soft argmax, thus, gradients can be backpropagated. Finally, two independent networks are designed to regress SMPL parameters which are used to render realistic body mesh.

Pose and shape estimation. Given a detailed body model, it is common to estimate body volume or mesh along with 3D pose. Early models were mainly based on the combination of simple geometric objects [22, 21], while recent models are PCA-based parametric models like SMPL [8]. Bogo et al. [1] were the first to apply SMPL for body pose and shape recovery. Their method was based on regular optimization procedures given an objective function minimizing projected joints and pre-estimated 2D joints. They applied several constraints including pose angles and shape priors. Lassner et al. [7] extended previous work by including a bi-directional distance between projected mesh and body silhouette. Later, researchers started to embed SMPL within deep models. Kanazawa et al. [6]

regressed SMPL pose and shape along with camera parameters in an adversarial training where generated mesh is given to a discriminator network to classify real/fake body with respect to real body scans. Similar to our work,

[18, 16] estimate pose and shape parameters from intermediate information, like body segments [16] or 2D joint heatmaps and body mask [18], and include SMPL to render 3D joints/mesh which are given to a loss function. However, these representations are ill-posed and suboptimal to regress SMPL parameters. We show 3D joints can better deal with this problem significantly outperforming aforementioned solutions. Recently, Varol et al. [26] proposed a multi-tasking approach to estimate body 2D/3D pose, pixel segments and volumetric shape. They use SMPL to generate ground truth body volumes and do not embed SMPL function within the network.

In this paper, we propose an approach to estimate 3D body pose and shape by the use of intermediate information and SMPL model. Given ground truth 3D data, we can benefit from end-to-end training and improving CNN. Besides, our method can be adapted to 2D-to-3D solutions in the lack of such ground truth data to in-the-wild scenarios.

3 Methodology

We estimate 3D human joints and shape from single RGB image , where , and and are the number of body joints and mesh points, respectively. In order to compute a detailed mesh, we use SMPL model [8] following recent trends in human pose and shape recovery. The goal is to estimate SMPL parameters from image using deep learning without directly regressing them. This way, we avoid possible artifacts while keeping the architecture flexible in the lack/presence of ground truth data. Our network contains three main modules shown in Fig. 2. Next, we explain each module details.

3.1 SMPL review

SMPL is a statistical parametric function which maps shape parameters and axis-angle pose parameters into vertices , given learned model parameters . Given a template average body shape and a dataset of scanned bodies, two sets of PCA coefficients and are learned to form model parameters (, and including pelvis as root). Then, Template shape can be updated to by summing to the displacements offsets computed based on parameters and bases . Bases are responsible for small pose-based displacements due to body soft-tissue behavior and have small contribution in the shape deformation. Given joints kinematic tree, a set of relative rotation matrices are computed for each joint with respect to their parents in the kinematic tree. Each is a function of and is computed based on Rodriguez formulation. These rotation matrices are used for two reasons: i) to update template shape in rest pose by basis , and ii) to form body pose based on standard rendering techniques in which body parts are rotated relatively in the kinematic tree. The rendering must be applied based on updated joints. Therefore, joints in template mode are updated beforehand by a regressor function (as part of parameters ) and updated vertices . SMPL model has several characteristics. First of all it is a differentiable function which makes it possible to be used along with deep networks. However, it does not implicitly constraints invalid pose and shape values and, thus, it is a many-to-one function. This means, given a RGB image, end-to-end training of SMPL model from scratch on top of a CNN may converge to a non-optimal solution. One of the reasons is the usage of Rodriguez formulation and axis-angle pose parameters, as it is known not to be unique. A possible solution is to directly use rotation matrices as proposed in [7, 16]. Secondly, SMPL is a generative model which allows us to generate millions of mocap data for free. Even it can be used to generate synthetic realistic images [27].

3.2 SMPL reverse

A natural way of embedding SMPL in a deep network is to estimate and given image and feed them to SMPL. However, due to previous limitations, this is a challenging task. Besides, direct regression of SMPL parameters may generate artifacts [6, 24]. Instead, researchers use intermediate representations like 2D joints/body silhouette [18] or body segments [16] to regress SMPL parameters. Although such representations are easier to be annotated from RGB images than ground truth SMPL data, they provide sub-optimal mapping to SMPL parameters. In this paper, we instead propose an autoencoder-like scheme, i.e. the input and output of the system are identical, while SMPL parameters are obtained in the encoder and SMPL is taken as decoder. We refer to this model as SMPL reverse (SMPLR).

We model SMPLR encoder with deep MLP networks. We design two independent networks and with the same architecture for pose and shape estimation, respectively, where correspond to network parameters. Since we define SMPLR as an autoencoder, naturally the input to the system must be and . However, is a high dimensional vector and all points do not necessarily contribute equally in the computation of pose and shape parameters, wasting network capacity if all of them are considered. To cope with this issue, we select a subset of points as landmarks , which mostly represent body shape and complement . We show the 18 selected landmarks and their assigned kinematic tree in Fig. 3. Next, we explain networks details.

Given and concatenated as and kinematic tree , we define to be the vector of normalized relative joints as:


where defines fatherhood indices. The reason for this normalization is that to compute relative joint rotation , we do not need to know relative distances. Besides, this frees network capacity from unnecessary data variability. Such relative distances can be embedded in the computation of shape parameters. Therefore, given relative distances computed from template mesh , we define normalized relative distances as:


We also include gender as an extra term to the shape parameters since SMPL originally provides two different models, male and female. By doing so we can estimate gender to select proper SMPL model and differentiate SMPL based on this parameter.

SMPLR architecture.

We design the network as a deep MLP network with 9 sequential fully connected layers with hidden neurons 1024, 1024, 512, 512, 512, 256, 256, 11, 11. We apply a residual block with 2 fully connected layers with hidden neurons 256 and 11 branching from fifth layer and concatenated to the second last layer. This residual block helps to have a more stable training and accurate result. All layers except last are followed by batch normalization and ReLU activation. We assign tanh activation to the last layer of

, while has no activation in the last layer. An advantage of SMPLR is that it can be trained by the aim of millions of generated mocap data. Also , as intermediate image representation, has visual evidence from the image and can be estimated by any relevant CNN directly from the image while the whole model can be trained end-to-end. In section 3.3 we explain applied CNN for this task.

Figure 3: Selected landmarks and assigned kinematic tree in red (color blue shows original kinematic tree).

Independent training of and is critical before combination to SMPL due to aforementioned many-to-one behavior of SMPL. End-to-end training of SMPLR provides an improvement over pose and shape parameters and consequently over SMPL output. The reason is that SMPL provides an explicit correlation of pose and shape components with respect to and . To supervise shape and pose regression networks, a typical loss has been applied in recent works [24, 18]. However, we found loss has problems in convergence and generalization in case of noisy inputs. Instead, we used loss for both and networks and it worked well in practice. For supervision of end-to-end training, one can apply a loss on both and as well with respect to their ground truth values. However, we found applying a loss on the whole set of points to be unnecessary since the points are not distributed equally on the body surface and many of them have low contribution in the computation of shape. Therefore we replace with its predefined set and apply an loss on and while keeping intermediate supervision on and .

3.3 CNN backbone

To apply SMPLR in practice we need to initialize to feed it to the system. contains visual information cues from the image, and state-of-the-art CNN are able to predict it accurately. For this task, we select SHN [14]

as a standard network for human pose estimation. SHN is a stack of hourglass modules which are encoder-decoder networks with residual connections between encoder-decoder layers. The outputs of each stack are i) a fusion of feature layers fed into the next stack and ii) a heatmap for each joint which has a supervision on it. We extend SHN to volumetric heatmaps, as a trend in recent works

[26, 19]

, using 5 stacks. A volumetric heatmap can be defined as an extension of 2D heatmap to third dimension. First, we define a fixed size cube with 2.5 meters for each axis and transform the joints within this cube such that the root joint is located at the center of cube. Then each voxel is defined as a grid box within this reference cube and take a value from the Gaussian distribution with respect to 3D Euclidean distance to ground truth voxel and variance 1. Thus, volumetric heatmap is a tensor with shape

with , and as width, height and depth length, respectively. We define tensor shape as . Although, this may introduce an offset error within depth, we found increasing does not show a significant improvement while increases computational complexity exponentially. Volumetric heatmaps are defined and computed for all joints and stacked in the 4th dimension ().

Figure 4: Sample volumetric heatmap of joints (middle) and limbs (right), each limb coded with a different color.

To train the network, we found cross-entropy with sigmoid activation works better than loss. Although the network is training as expected, the loss function does not compute any correlation among joints and network features do not learn such correlations explicitly. A solution is to use multi-task learning from different domains like body semantical segments or silhouette or 2D/3D joints[26]. However, multi-task learning needs a careful design of flow information and loss balancing among tasks. In this paper, we propose to include volumetric limb heatmaps along with joint heatmaps. By doing so we can enforce the network to learn part correlations implicitly. We compute limb heatmaps for the main 8 bones of the body: lower arm, upper arm, lower leg and upper leg for both left and right parts, along with hands, feet and shoulders. Then we combine them into 4 major parts to avoid extra computation: left arm, right arm, left leg and right leg. To compute limb heatmaps we extend joint Gaussian across bone between two joints using the orthogonal distance of each voxel to the connecting limb segment. Thus, we extend volumetric heatmaps to a tensor with shape . We show a sample volumetric heatmap in Fig. 4. Finally we apply soft-argmax [30, 10] to compute joint locations from heatmaps.

3.4 Denoising autoencoder

Estimated joints by SHN have structured noise. For instance, in the case of occluded joints the error is higher due to their ambiguity and the lack of visual evidence. Even the error of visible joints can be modeled by Gaussian. Such structured or Gaussian noise can be detected and learned explicitly helping to further improve initial estimation fed into SMPLR module. Denoising autoencoder (DAE) networks [28] are useful tools for such a scenario and are shown to be able to learn structured patterns of the input better than ordinal autoencoders.

In this paper, we propose a DAE network as a bridge between SHN and SMPLR. With the proposed DAE we are able to denoise 3D inputs or lift 2D estimations to 3D. This procedure can be critical for error prune CNNs, for instance fast shallow networks. Moreover, it can be detached from CNN and trained independently given millions of mocap data with adversarial noise generation mimicking structured or Gaussian noise. The architecture of DAE is shown in Fig. 2. We apply two dropouts, one right after the input and the other after last encoder layer. By doing so we mimic a two-stack DAE [28] without loss of generalization. By applying skip connections between encoder and decoder layers we force the network to learn noise structure faster and more stable way. The input to DAE is initial and the output is denoised . We apply loss to train the network. To force network to be more aware of adjacent joints correlations, we apply an loss on (Eq. 2) as well.

  Model Hd Ts Sr Ew Wt Hp Kn Ft Avg. Avg. Avg.
  Jt Lm Bn


47.5 23.0 44.2 77.2 112.0 16.3 61.9 102.7 62.8 - 10.4
46.2 23.0 43.0 75.5 110.2 15.3 61.2 102.2 61.9 61.5 9.3
45.0 22.3 41.1 72.7 105.7 14.4 59.6 99.8 57.5 59.3 9.0
40.8 20.9 38.0 66.8 93.4 14.3 55.7 92.9 53.0 54.3 9.7


45.8 22.2 42.2 75.4 108.5 14.4 61.8 103.2 59.2 61.1 9.5
51.4 23.1 46.2 83.7 121.7 15.1 66.6 115.4 65.2 66.1 11.8


16.3 8.9 13.6 17.7 19.7 5.5 11.6 21.3 14.3 14.3 6.6
16.5 9.1 13.8 17.3 19.9 5.8 11.7 21.3 14.4 11.5 6.6
55.1 22.3 48.8 85.8 127.1 13.4 68.6 122.3 67.8 - -
50.1 20.1 44.9 83.6 123.6 12.4 63.9 111.9 63.8 - -


53.6 22.7 47.5 85.1 125.6 14.1 65.6 113.7 66.0 68.5 8.9
40.5 22.4 38.2 67.1 93.7 15.6 52.1 84.7 51.8 52.9 9.5
Table 1: Ablation study of model components. Error in mm. Hd:Head, Ts:Torso, Sr:Shoulder, Ew:Elbow, Wt:Wrist, Hp:Hip, Kn:Knee, Ft:Foot, Jt:Joints, Lm:Landmarks and Bn:Bone length, : input without landmarks, : training with augmentation, : including limbs heatmaps in the output, : training along with SMPL loss.

4 Experiments

4.1 Training details

All models and experiments had been implemented on TensorFlow and performed on a GTX 1080 Ti. We used Adam optimizer in all trainings with a learning rate of

for SHN and for DAE, pose and shape estimator networks. SHN converged in -epochs with batch size - samples. The rest of networks in the ablation analysis were trained with batch size . We have used a keeping probability 0.8 for drop-out layer in DAE.

4.2 Datasets

UP-3D [7]. This dataset was designed by fitting a gender neutral SMPL model into images from LSP, LSP-extended and MPII-HumanPose datasets, keeping samples with better estimations. This yields a total of labeled images in the wild, splitted into for training, for validation and for test. Every sample is provided with 2D annotations and SMPL parameters.

SURREAL [27]. Synthetic dataset of humans generated with SMPL model, thus with exact annotations. This is the dataset used for most of our analyses. It is composed of videos containing SMPL generated humans moving on top of random backgrounds, split into for training, for validation and for test samples.

Human3.6M [5]. Human3.6M is the largest dataset offering high precision 3D data thanks to MoCap sensors and calibrated cameras. It is composed of RGB videos of different subjects performing different actions twice while recorded from different viewpoints. This corresponds to around million frames. Standard evaluation protocol is training on subjects S1, S5, S6, S7 and S8, leaving S9 and S11 for validation and finally S2, S3, S4, S10 for testing.

4.3 Evaluation protocol

We evaluate the models by mean per joint position error (MPJPE) in milimeter (mm). The same metric is extended to surface points to report error on generated body mesh. Following related works we apply two protocols: Protocol 1 where all joints/points are subtracted from the root joint and, and Protocol 2 where estimated joints are aligned with ground truth through Procrustes analysis.

4.4 Ablation study

In this section we study different components of the proposal on SURREAL validation set. For this task we subsample the training dataset into around 89K frames such that every pair of samples has a minimum per joint distance of 100mm. Therefore, subsampling has a uniform distribution over the whole dataset. We use the same setup as Sec.

4.1 to train each component. We show the results and each component description in Tab. 1.

4.4.1 Stacked Hourglass

Here we analyze the different techniques implemented to enhance the performance of Stacked Hourglass. The results are shown in Tab. 1 under SHN row.

Figure 5: Qualitative images. Rows from top to bottom: SURREAL, UP-3D and Human3.6M.

Default Volumetric SHN is a simple extension of SHN to volumetric heatmaps. We use 16 depth bins in all models. We define a few baseline networks to address experiments in the next components. and are the models trained based on joints with and without landmarks (totally 18). Comparing them shows 1mm improvement in average by including landmarks. We also trained with 32 depth bins. However we found no gain in the results while increasing complexity exponentially. Then we train including data augmentation (). Besides regular image data augmentation like random color noise, flipping and rotation, two extra methods are applied. First, random background. By using binary masks for subjects provided at each frame, we remove the default background of the samples and replace it with a random image from VOC Pascal dataset. Similarly, we place random objects on random positions of the image to artificially create occlusions [19]. Again, the data from which objects come from is VOC Pascal dataset and, in both cases, we do not use images containing humans. Table 1 shows this augmentation is specially useful for joint prediction, while landmarks, although show improvement, do not benefit that much.

Figure 6: UP-3D results. Based on estimated 2D joints (middle) we compute 3D joints and render mesh (right).

Limb Volumetric Heatmaps. We include additional volumetric heatmaps in the outputs of . These heatmaps correspond to limb representations, composed by creating segments from joint to joint. By fitting these heatmaps we expect to enforce the model to learn spatial relationships among joints to improve generalization. The results displayed at table 1 show how this strategy does indeed enhance the performance of the network in .

4.4.2 Denoising autoencoder

In this section we analyze DAE on top of baseline model . The results are shown in Tab. 1 under DAE row. We train DAE independent to with three different inputs: i) uniform noise with constraints for each joint, ii) output (model ) and iii) output with depth set to 0 (model ). As a result, uniform noise has an average error 61.7mm (not shown in the table) similar to (61.9mm). This is while improves baseline by  3mm when the input has structured error coming from . In this way wrist takes the most benefit. For the third experiment, we want to test how DAE performs in the lack of image-level 3D ground truth data. As a result, the average error is slightly more than 65mm (). Although, the average error is 3mm higher than the baseline, it shows DAE can lift 2D pose to 3D with successful results given it is an ill-posed problem. We note that training comparing to converges way slower.

4.4.3 Smplr

In this section we analyze SMPLR on top of baseline model . The results are shown in Tab. 1 under SMPLR row. Similar to DAE, we train shape and pose networks and independent to . We report the error based on SMPL results. To analyze the effect of each network on the final results, we train and independently. We then feed ground truth shape to and groundtruth pose to .

Model is trained based on , i.e. the input has not landmarks. While the average joints error ( 14mm) is identical among and , the error on the landmarks (11.5 vs 14.3) shows benefit in estimating shape. We then compare pose parameters estimation. Model is trained with loss on pose parameters along with SMPL loss. Comparing and shows 4mm gain by end-to-end training with SMPL. We found training including SMPL loss does not change its results. In general, the higher source of errors in SMPLR is in pose parameters rather than shape. As an extra experiment, we evaluated the accuracy of gender estimation in and we achieved 89.5 % accuracy. This level of accuracy is critical for SMPL rendering since it has two different models for each gender which can have a direct impact on the mesh error.

Prot. 1 Prot. 2 Prot. 1 Prot. 2
Bogo [1] - 82.3 Tome [25] 88.4 70.7
Lassner [7] - 80.7 Pavlakos [17] 71.9 51.9
Pavlakos [18] - 75.9 Zhou [31] 64.9 -
Omran [16] - 59.9 Martinez [11] 62.9 47.7
Kanazawa [6] 87.9 58.1 Sun [23] 59.1 48.3
SMPLR 71.7 56.4 SHN-final 62.4 50.9
Table 2: MPJPE error in mm. on Human3.6M for both protocols. Left columns: comparing to SMPL-based methods. Right columns: comparing to state-of-the-art 3D pose. SMPLR outperforms all SMPL-based methods, and our simple SHN updates show competitive, near state of the art, results.
SMPL surface error (mm) 3D joints error (mm)
Varol [26] 73.6 46.1
SMPLR 75.4 55.8
SHN-final - 45.8
Table 3: Results on SURREAL dataset (protocol 1).

4.5 End-to-end training

Here, we describe how the end-to-end training was performed. Thanks to joint location estimation methodology, the model is differentiable and trainable end-to-end. SMPLR model appears to be very sensitive to noise, which makes whole network training hard. For that reason, we explore the possibility of using an already trained SMPLR (model ) on top of to improve the learning of the latter by back-propagating errors through SMPLR. We first show the results of SMPL output by stitching previously trained models into model. We found end-to-end training of this model on top of does not show much improvement. Therefore, we propose the next procedure for end-to-end training. We first train DAE and SMPLR on ground truth. Once trained, it is frozen and appended to the end of .

The order of magnitude of SHN loss is several times lower than SMPLR losses, therefore, without a proper balancing, the weights of the vanish after few training steps. We empirically set this balance to be around - and -. Finetuning is performed with a low learning rate to ensure the learning stability, empirically set to -. The network shows improvement after the first training epoch, and after an additional epoch it fully converges. To ensure that this improvement is the result of appending DAE and SMPLR, we trained alone for more than epochs without showing any improvement before end-to-end training. Looking at table 1 under E2E row (model ), one can see this model slightly improves performance, but on a higher degree on the legs. Note the reported error is based on after end-to-end training.

4.6 State-of-the-art comparison

Human3.6M. We compare our results to state-of-the-art on Human3.6M in Tab. 2, split in two sets: SMPL-based solutions vs. only 3D pose recovery. In the former, we outperform state-of-the-art for both evaluation protocols, specially in protocol 1 improving [6] over 16mm. We note that we use network trained on SURREAL dataset to estimate shape parameters, since Human3.6M contains just 5 subjects in the training set. The results in the second set shows our simple modifications to SHN achieves competitive near state-of-the-art results. Comparing to [17], a fixed small depth resolution of 16 bins in the volumetric heatmap works better than a coarse-to-fine setup. We show some qualitative results in Fig. 5.

SURREAL. The only competitor on this dataset is [26]. We show similar performance to this work for SMPL surface and 3D joints error using protocol 1. While SMPL output has 10mm error worse than SHN due to the offset error embedded in the regressor, our SHN-final slightly improves [26] results for 3D joint estimation. Given the ability to generate images in this dataset based on mocap data, it is likely to train SHN on this dataset to learn structured error in DAE for in-the-wild scenarios. While this is applicable in our method, [26] needs all ground truth data to train the network. We show some qualitative images in Fig. 5.

UP-3D. We use this dataset to show some qualitative results for in-the-wild scenario. In one setup we train all networks with UP-3D ground truth data plus a subset of SURREAL of 18K images. We show some samples from the test set in Fig. 6. The SMPL error in test set is around 90mm which is below 100.5mm error reported in [18]. In the second setup we just use 2D data to train SHN. Then, we use previously trained DAE and SMPLR networks to lift 2D joints to 3D and render body mesh. The results shown in Fig. 6 shows accurate estimates in this in-the-wild scenario.

5 Conclusions

We proposed a deep based framework to recover 3D pose and shape from a still RGB image. Our model is composed of SHN backbone followed with a DAE and a network capable of reversing SMPL from sparse data. Such model, capable of end-to-end training, is able to accurately reconstruct body mesh. We experimentally found that processing SHN output joints with DAE is able to remove structured error. We have shown that SMPL model can be reversed and used to recover 3D pose and shape. Furthermore, the combination of both DAE and SMPLR has proven to ease the learning process. Finally, we exploit SMPLR capabilities in the training of deep learning networks by backpropagating SMPL related errors through the SHN. We evaluated our proposal on SURREAL and Human3.6M datasets, outperforming SMPL-based state-of-the-art alternatives.


  • [1] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In European Conference on Computer Vision, pages 561–578. Springer, 2016.
  • [2] E. Brau and H. Jiang. 3d human pose estimation via deep learning from 2d annotations. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 582–591. IEEE, 2016.
  • [3] C.-H. Chen and D. Ramanan. 3d human pose estimation= 2d pose estimation+ matching. In

    Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on

    , pages 5759–5767. IEEE, 2017.
  • [4] V.-T. Hoang and K.-H. Jo. 3d human pose estimation using cascade of multiple neural networks. In IEEE Transaction on Industrial Informatics, 2018.
  • [5] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, jul 2014.
  • [6] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. End-to-end recovery of human shape and pose. In CVPR, 2018.
  • [7] C. Lassner, J. Romero, M. Kiefel, F. Bogo, M. J. Black, and P. V. Gehler. Unite the people: Closing the loop between 3d and 2d human representations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4704–4713. IEEE, 2017.
  • [8] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6):248, 2015.
  • [9] C. Luo, X. Chu, and A. Yuille. Orinet: A fully convolutional network for 3d human pose estimation. BMVC, 2018.
  • [10] D. C. Luvizon, H. Tabia, and D. Picard. Human pose regression by combining indirect part detection and contextual information. arXiv preprint arXiv:1710.02322, 2017.
  • [11] J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple yet effective baseline for 3d human pose estimation. In ICCV, 2017.
  • [12] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H.-P. Seidel, W. Xu, D. Casas, and C. Theobalt. Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM Transactions on Graphics (TOG), 36(4):44, 2017.
  • [13] F. Moreno-Noguer. 3d human pose estimation from a single image via distance matrix regression. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1561–1570. IEEE, 2017.
  • [14] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, pages 483–499. Springer, 2016.
  • [15] A. Nibali, Z. He, S. Morgan, and L. Prendergast. 3d human pose estimation with 2d marginal heatmaps. arXiv preprint arXiv:1806.01484, 2018.
  • [16] M. Omran, C. Lassner, G. Pons-Moll, P. Gehler, and B. Schiele. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In 2018 International Conference on 3D Vision (3DV), pages 484–494. IEEE, 2018.
  • [17] G. Pavlakos, X. Zhou, K. G. Derpanis, and K. Daniilidis. Coarse-to-fine volumetric prediction for single-image 3d human pose. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 1263–1272. IEEE, 2017.
  • [18] G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis. Learning to estimate 3d human pose and shape from a single color image. CVPR, 2018.
  • [19] I. Saŕańdi, T. Linder, K. O. Arras, and B. Leibe. Synthetic occlusion augmentation with volumetric heatmaps for the 2018 eccv posetrack challenge on 3d human pose estimation. In ECCV PoseTrack Workshop, 2018.
  • [20] L. Sigal, A. O. Balan, and M. J. Black. Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International journal of computer vision, 87(1-2):4, 2010.
  • [21] L. Sigal, M. Isard, H. Haussecker, and M. J. Black. Loose-limbed people: Estimating 3d human pose and motion using non-parametric belief propagation. International journal of computer vision, 98(1):15–48, 2012.
  • [22] C. Stoll, N. Hasler, J. Gall, H.-P. Seidel, and C. Theobalt. Fast articulated motion tracking using a sums of gaussians body model. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 951–958. IEEE, 2011.
  • [23] X. Sun, J. Shang, S. Liang, and Y. Wei. Compositional human pose regression. 2017.
  • [24] V. Tan, I. Budvytis, and R. Cipolla. Indirect deep structured learning for 3d human body shape and pose prediction. BMVC, 2017.
  • [25] D. Tome, C. Russell, and L. Agapito. Lifting from the deep: Convolutional 3d pose estimation from a single image. CVPR 2017 Proceedings, pages 2500–2509, 2017.
  • [26] G. Varol, D. Ceylan, B. Russell, J. Yang, E. Yumer, I. Laptev, and C. Schmid. BodyNet: Volumetric inference of 3D human body shapes. In ECCV, 2018.
  • [27] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid. Learning from synthetic humans. In CVPR, 2017.
  • [28] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.

    Journal of machine learning research

    , 11(Dec):3371–3408, 2010.
  • [29] W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, and X. Wang. 3d human pose estimation in the wild by adversarial learning. In CVPR, 2018.
  • [30] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua. Lift: Learned invariant feature transform. In European Conference on Computer Vision, pages 467–483. Springer, 2016.
  • [31] X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. Wei. Towards 3d human pose estimation in the wild: a weakly-supervised approach. ICCV, 2017.