HEMlets PoSh: Learning Part-Centric Heatmap Triplets for 3D Human Pose and Shape Estimation

03/10/2020 ∙ by Kun Zhou, et al. ∙ 0

Estimating 3D human pose from a single image is a challenging task. This work attempts to address the uncertainty of lifting the detected 2D joints to the 3D space by introducing an intermediate state-Part-Centric Heatmap Triplets (HEMlets), which shortens the gap between the 2D observation and the 3D interpretation. The HEMlets utilize three joint-heatmaps to represent the relative depth information of the end-joints for each skeletal body part. In our approach, a Convolutional Network (ConvNet) is first trained to predict HEMlets from the input image, followed by a volumetric joint-heatmap regression. We leverage on the integral operation to extract the joint locations from the volumetric heatmaps, guaranteeing end-to-end learning. Despite the simplicity of the network design, the quantitative comparisons show a significant performance improvement over the best-of-grade methods (e.g. 20% on Human3.6M). The proposed method naturally supports training with "in-the-wild" images, where only weakly-annotated relative depth information of skeletal joints is available. This further improves the generalization ability of our model, as validated by qualitative comparisons on outdoor images. Leveraging the strength of the HEMlets pose estimation, we further design and append a shallow yet effective network module to regress the SMPL parameters of the body pose and shape. We term the entire HEMlets-based human pose and shape recovery pipeline HEMlets PoSh. Extensive quantitative and qualitative experiments on the existing human body recovery benchmarks justify the state-of-the-art results obtained with our HEMlets PoSh approach.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human pose estimation from a single image is an important problem in computer vision, because of its wide applications, e.g., video surveillance and human-computer interaction. Given an image containing a single person, 3D human pose inference aims to predict 3D coordinates of the human body joints. Recovering 3D information of human poses from a single image faces several challenges. The challenges are at least three folds: 1) reasoning 3D human poses from a single image is by itself very challenging due to the inherent ambiguities; 2) for such a regression task, existing approaches have not achieved a good balance between representation efficiency and learning effectiveness; 3) for “in-the-wild” images, both 3D capturing and manual labeling require a lot of efforts to obtain high-quality 3D annotations, making the training data extremely scarce.

For 2D human pose estimation, almost all best performing methods are detection based [26, 17, 50]. Detection-based approaches essentially divide the joint localization task into local image classification tasks. The latter is easier to train, because it effectively reduces the feature and target dimensions for the learning system [41]. Existing 3D pose estimation methods often use detection as an intermediate supervision mechanism as well. A straightforward strategy is to use volumetric heatmaps to represent the likelihood map of each 3D joint location  [31]. Sun et al[41] further proposed a differentiable soft-argmax operator that unifies the joint detection task and the regression task into an end-to-end training framework. This significantly improves the state-of-the-art 3D pose estimation accuracy.

Fig. 1: Overview of the HEMlets-based 3D pose estimation. (a) Input RGB image. Our algorithm encodes (b) the 2D locations for the joints and , but also (c) their relative depth relationship for each skeletal part into HEMlets. (d) Output 3D human pose.

In this work, we propose a novel effective intermediate representation for 3D pose estimation - Part-Centric Heatmap Triplets (HEMlets) (as shown in Fig. 1). The key idea is to polarize the 3D volumetric space around each distinct skeletal part, which has the two end-joints kinematically connected. Different from [30], our relative depth information is represented as three polarized heatmaps, corresponding to the different state of the local depth ordering of the part-centric joint pairs. Intuitively, HEMlets encodes the co-location likelihoods of pairwise joints in a dense per-pixel manner with the coarsest discretization in the depth dimension. Instead of considering arbitrary joint pairs, we focus on kinematically connected ones as they possess semantic correspondence with the input image, and are thus a more effective target for the subsequent learning. In addition, the encoded relative depth information is strictly local for the part-centric joint pairs and suffers less from potential inconsistent data annotation.

The proposed network architecture is shown in Fig. 3. A ConvNet is first trained to learn the HEMlets and 2D joint heatmaps, which are then fed together with the high-level image features to another ConvNet to produce a volumetric heatmap for each joint. We leverage on the soft-argmax regression [41] to obtain the final 3D coordinates of each joint. Significant improvements are achieved compared to the best competing methods quantitatively and qualitatively. Most notably, our HEMlets method achieves a record MPJPE of 39.9mm on Human3.6M [14], yielding about improvement over one best-of-grade method [41].

The merits of the proposed method lie in three aspects:

  • Learning strategy. Our method takes on a progressive learning strategy, and decomposes a challenging 3D learning task into a sequence of easier sub-tasks with mixed intermediate supervisions, i.e., 2D joint detection and HEMlets learning. HEMlets is the key bridging and learnable component leading to 3D heatmaps, and is much easier to train and less prone to over-fitting. Its training can also take advantage of existing labeled datasets of relative depth ordering [30, 38].

  • Representation power. HEMlets is based on 2D per-joint heatmaps, but extends them by a couple of additional heatmaps to encode local depth ordering in a dense per-pixel manner. It builds on top of 2D heatmaps but unleashes the representation power, while still allowing leveraging the soft-argmax regression [41] for end-to-end learning.

  • Simple yet effective. The proposed method features a simple network architecture design, and it is easy to train and implement. It achieves state-of-the-art 3D pose estimation results validated by the evaluations over all standard benchmarks.

A preliminary version of this work on 3D human pose estimation was published in the IEEE/CVF Conference on Computer Vision (ICCV) 2019 [55]. This paper makes a few major contributions and extensions over the initial conference version as follows. 1) We extend the proposed HEMlets pose framework to further recover the human body model from the given input image. We design a simple body model regression network connected to the preceding HEMlets pose network to recover a SMPL human body mesh from a single color image. 2) We provide more implementation details for our complete network, which also include some related prepossessing during the training phase. 3) We conduct thorough experiments including more ablation studies, as well as quantitative and qualitative evaluations for the recovered human body shape and pose. Extensive experiments justify the state-of-the-art performance of the proposed HEMlets-based pose and shape estimation method (termed as HEMlets PoSh) on all mainstream benchmark datasets. In addition, 4) this paper introduces a new weakly-annotated FBI dataset and elaborates its advantages in obtaining weak annotations for the relative depth relationship between a pair of skeletal joints. We also provide comparisons of the FBI dataset and the recent Ordinal dataset[30].

2 Related Work

In this section, we review the approaches that are based on deep ConvNets for 3D human pose estimation and 3D body model recovery from a single color image.

2.1 3D Body Pose Estimation

We first conduct the literature review of 3d pose estimation in the following four aspects.

Direct Encoder-Decoder.

With the powerful feature extraction capability of deep ConvNets, many approaches 

[21, 43, 28] learn end-to-end Convolutional Neural Networks (CNNs) to infer human poses directly from the images. Li and Chen [21] are the first who used CNNs to estimate 3D human pose via a multi-task framework. Tekin et al[43] designed an auto encoder to model the joint dependencies in a high-dimensional feature space. Park et al[28] proposed fusing 2D joint locations with high-level image features to boost the estimation of 3D human pose. However, these single stage methods are limited by the availability of 3D human pose datasets and cannot take advantage of large-scale 2D pose datasets that are vastly available.

Transition with 2D Joints. To avoid collecting 2D-3D paired data, a large number of works [35, 56, 52, 23, 11, 38] decompose the task of 3D pose estimation into two independent stages by: 1) firstly inferring 2D joint locations using well-studied 2D pose estimation methods, such as [56, 35]; 2) and then learning a mapping to lift them into the 3D space. These approaches mainly focus on tackling the second problem. For example, a simple fully connected residual network is proposed by Martinez et al[23] to directly recover 3D human pose from its 2D projection. Fang et al[11] considered prior knowledge of human body configurations and proposed human pose grammar, leading to better recovery of the 3D pose from only 2D joint locations. Yang et al[52] adopted an adversarial learning scheme to ensure the anthropometrical validity of the output pose and further improved the performance. Recently, by involving a reprojection mechanism, the proposed method in [49] shows insensitivity to overfitting and accurately predicts the result from noisy 2D poses. Though promising results have been achieved by these two-stage methods, a large gap exists between the 3D human pose and its 2D projections due to inherent ambiguities.

3D-Aware Intermediate States. To further bridge the gap between the 2D image and the target 3D human pose under estimation, some recent works [31, 38, 30, 41] proposed to involve 3D-aware states for intermediate supervisions. Namely, a network is firstly trained to map the input image to these 3D-aware states, and then another network is trained to convert those states to the 3D joint locations. Finally, these two networks are combined and optimized jointly. A volumetric representation for 3D joint-heatmaps is proposed in  [31]

, with which the 3D pose is regressed in a coarse-to-fine manner. However, regressing a probability grid in the 3D space globally is also a very challenging task. It usually suffers from quantization errors for the joint locations. To address this issue, Sun 

et al[41] exploited a soft-argmax operation and proposed an end-to-end training scheme for the 3D volumetric regression, achieving by far the best performance on 3D pose estimation. Inspired by [33] that the relative depth ordering across joints is helpful for resolving pose ambiguities, Pavlakos et al[30] adopted a ranking loss for pairwise ordinal depth to train the 3D human pose predictor explicitly. A similar scheme of relative depth supervision is utilized in the work of [35]. Forward-or-Backward Information (FBI), proposed in  [38], is another kind of relative depth information but focuses more on the bone orientations. Recently, Sharma et al[37]

proposed to train a deep conditional variational autoencoder to map 2D poses to 3D poses by learning ordinal maps. In this work, we propose HEMlets, a novel representation that encodes both 2D joint locations and the part-centric relative depth ordering simultaneously. Experiments justify that this representation reaches by far the best balance between representation efficiency and learning effectiveness.

“In-the-Wild” Adaptation. All the aforementioned approaches are mainly trained on the datasets collected under indoor settings, due to the difficulty of annotating 3D joints for “in-the-wild” images  [5]. Thus, many strategies are developed to make domain adaptation. By exploiting graphics techniques, previous works [47, 8] have synthesized a large “faked” dataset mimicking real images. Though these data benefit 3D pose estimation, they are still far from realistic, making the applicability limited. Recently, both Pavlakos et al[30] and Shi et al[38] proposed to label the relative depth relationship across joints instead of the exact 3D joint coordinates. This weak annotation scheme not only makes building large-scale “in-the-wild” datasets feasible but also provides 3D-aware information for training the inference model in a weakly-supervised manner. With HEMlets representation, we can readily use these weakly annotated “in-the-wild” data for domain adaptation.

2.2 3D Body Model Recovery

In the recent years, 3D full body models have become popular, which are typically represented with a parametric human body space, such as SMPL [23]. The advantage is that a human body mesh can be easily generated from a set of body shape and pose parameters, so this turns the task of recovering a 3D body model from a single image into a problem of solving for a set of parameters. As a pioneering work in this arena, a two-stage framework [4] is proposed. It firstly infers 2D skeleton joints from the input image with a CNN-based model, and then searches the optimal parameters of SMPL to fit the joints with an optimization approach. Due to the depth ambiguity, the second stage tends to converge to a local minimum. To better address this problem, many works [16, 32, 19] proposed to build an end-to-end pipeline to map images into the parametric space using deep regression models.

However, the key challenge to train regression-based models is the lack of paired data, due to the inherent difficulty of annotating or capturing a groundtruth 3D model for a person instance. Existing approaches address this challenge roughly along three main directions. First, some works directly tackled the issue by putting efforts on constructing target datasets. The work of [47] firstly built a synthetic dataset using graphics techniques, however, training only on this dataset is still difficult to produce a model that is applicable to real images. Lassner et al[20] proposed to apply the algorithm of [4] to obtain 3D body models for real images and then manually sift out the reasonable results, to build the final human body dataset. Unfortunately, the obtained 3D human shapes are still non-ideal and contain erroneous body part results.

For the second category, several works attempted to directly add extra constraints on the output parameters to ease the training process. For example, the strategy of adversarial learning was utilized in [16], where a discriminator is exploited to constrain the regressed parameters against a reasonable distribution. Given the regressed SMPL models, Kolotouros et al[18] further adopted the approach of [4] to obtain better parameters to fit the input images. They then directly set an extra objective for the regression model to enforce its output parameters equal to the optimized ones.

Lately, a large body of works proposed to use 2D intermediate representations, such as silhouettes [42, 32] and densepose [1]-based representation [51, 13], which leverage on the idea of self-supervised training. Specifically, other than the main branch of mapping input images to SMPL meshes, these methods also constructed a novel branch to convert the input images to the proposed intermediate representations. A differentiable mesh renderer was then utilized to render the output meshes, which are compared with the learnt intermediate representations. This brings extra supervisions to guide the training process.

More recently, some other 3D representations, such as voxels, mesh and UV-maps, have been used for building generative neural networks to infer 3D models from images. The work of [46], as the first attempt of this kind, proposed an approach to generate a voxel representation of a 3D body from a single image using 3D ConvNets. By taking a template human body mesh as an extra input and treating the mesh as a graph representation, Kolotouros et al[19] trained a Graph ConvNet to learn the deformation of the template model for fitting both the target pose and shape. The algorithm of DenseBody [53] represented the 3D body model with a parameterized UV-map, and then turned the task of geometry inference into a problem of image synthesis. This method further advances body mesh reconstruction accuracy.

In this work, we find that the impact of 3D pose estimation on the accuracy of the final recovered body model is much more significant than the regressed shape parameters. Based on the estimated 3D pose obtained with our HEMlets pose approach, we show that a simple body regression method for SMPL model inference outperforms all the afore-discussed approaches.

Fig. 2: Part-centric heatmap triplets where and are the parent joint and the child joint. (a, b) Joints and skeletal parts. We locate the parent joint of the -th skeletal part at the zero polarity heatmap  (c-e). The child joint is located, according to relative depth of and , in the positive (c), zero (d) and negative polarity heatmap (e), respectively.
Fig. 3: The network architecture of our proposed approach. It consists of four major modules: (a) A ResNet-50 backbone for image feature extraction. (b) A ConvNet for image feature upsampling. (c) Another ConvNet for HEMlets learning and 2D joint detection. (d) A 3D pose regression module adopting a soft-argmax operation for 3D human pose estimation. (e) Details of the HEMlets learning module. “Feature concatenate” denotes concatenating the feature maps from the HEMlets learning branch and the upsampling branch together.

3 HEMlets Pose Estimation

We propose a unified representation of heatmap triplets to model the local information of body skeletal parts, i.e., kinematically connected joints, whereas the corresponding 2D image coordinates and relative depth ordering are considered. By such a representation, images annotated with relative depth ordering of skeletal parts can be treated equally with images annotated with 3D joint information. While the latter is usually very scarce, the former is relatively easy to obtain [38, 30]. In this section, we first present the proposed part-centric heatmap triplets and its encoding scheme. Then, we elaborate a simple network architecture that utilizes the part-centric heatmap triplets for 3D human pose estimation.

3.1 Part-Centric Heatmap Triplets

We divide the full body skeleton consisting of joints into parts as shown in Fig. 2(a). Specifically, we use to denote the set of skeletal parts, where . For each part, we denote the two associated joints as , with being the parent node and being the child node. The relative depth ordering, denoted as , can be then described as a tri-state function [30, 38]:


where is used to adjust the sensitivity of the function to the relative depth difference. The absolute depths of the two joints and are denoted by and , respectively.

We argue that directly using the discretized label as an intermediate state for learning the 3D pose from a 2D joint heatmap, as was done in [30, 38], is not as effective. Since this abstraction tends to lose some important features encoded in the joints’ spatial domain. Instead of elevating the problem straight away to the 3D volumetric space, we utilize an intermediate representation of the 3D-aware relationship of the parent joint and the child joint of a skeletal part . Provided with the supervision signals, we define polarized target heatmaps where a pair of normalized Gaussian peeks corresponding to the 2D joint locations are placed accordingly across three heatmaps (see Figure 2). We term them as the negative polarity heatmap , the zero polarity heatmap and the positive polarity heatmap with respect to the function value in Eq. (1). The parent joint is always placed in the zero polarity heatmap . The child joint will appear in the negative/positive polarity heatmap, if its depth is larger/smaller than that of the parent joint  (i.e., ). Both parent and child joints are co-located in the zero polarity heatmap if their depths are roughly the same (i.e., ).

Formally, we denote the heatmap triplets of the skeletal part as the stacking of three heatmaps :


Given 3D groundtruth coordinates of all joints, we can readily compute the heatmap triplets of each skeletal part. For easy reference, we shall refer to the part-centric heatmap triplets as HEMlets, and use it afterwards.

Discussions. Here we provide some understandings of HEMlets from a few perspectives. First, different from a joint-specific 2D heatmap that models the detection likelihood for each intended joint on the plane, HEMlets models part-centric pairwise joints’ co-location likelihoods on the plane simultaneously with their ordinal depth relations. This helps to learn geometric constraints (e.g., bone lengths) implicitly. Second, by augmenting a 2D heatmap to a triplet of heatmaps, HEMlets learns and evaluates the co-location likelihood for a pair of connected joints

by the joint probability distribution

in a locally-defined volumetric space. In contrast, Pavlakos et al[30] relaxed the learning target and marginalized the 3D probability distributions independently for the plane i.e., and the -dimension, with the latter supervised independently by based on a ranking loss. Third, by exploiting the available supervision signals to a larger extent, HEMlets brings the benefit of making the knowledge more explicitly expressed and easier to learn, and bridges the gap in learning the 3D information from a given 2D image.

3.2 3D Pose Inference

Network architecture. We employ a fully convolutional network to predict the 3D human pose as illustrated in Figure 3. A ResNet-50 [12] backbone architecture is adopted for basic feature extraction. One of the two upsampling branches is used to learn the HEMlets and the 2D heatmaps of skeletal joints, and the other one is used to perform upsampling of the learned features to the same resolution as the output heatmaps. Both HEMlets and the 2D joint heatmaps are then encoded jointly by a 2D convolutional operation to form a latent global representation. Finally these global features are joined with the convolutional features extracted from the original image to predict a 3D feature map for each joint. We perform a soft-argmax operation [41] to aggregate information in the 3D feature maps to obtain the 3D joint estimations.

HEMlets loss. Let us denote with the groundtruth HEMlets of all skeletal parts and with the corresponding prediction. We use a standard distance between and to compute the HEMlets loss as follows:


where denotes an element-wise multiplication, and

is a binary tensor to mask out missing annotations.

Auxiliary 2D joint loss. As HEMlets essentially contains heatmap responses of 2D joint locations, we adopt a heatmap-based 2D joint detection scheme to facilitate HEMlets prediction. The loss of 2D joint prediction is computed as:


where is the groundtruth 2D heatmap of the -th 2D joint and is the corresponding network prediction.

Soft-argmax 3D joint loss. To avoid quantization errors and allow end-to-end learning, Sun et al[41] suggested soft-argmax regression for 3D human pose estimation. Given learned volumetric features of size for the -th joint, the predicted 3D coordinates are given as:


where denotes a voxel in the volumetric feature space of . For robustness, we employ the loss for the regression of 3D joints. Specifically, the loss is defined as:


where the groundtruth 3D position of the -th joint is given as . We use the same 2D and 3D mixed training strategy in [41] (): in Eq. (6) is set to when the training data is from 3D datasets, and when the data is from 2D datasets.

Training strategy. For HEMlets prediction, We combine and

for the intermediate supervision. The loss function is defined as:


By using and jointly as supervisions, we allow training the network using images with 2D joint annotations and 3D joint annotations. By 3D joint annotation, we refer to annotations with exact 3D joint coordinates or relative depth ordering between part-centric joint pairs.

The end-to-end training loss is defined by combining with :


where in all our experiments.

3.3 Implementation Details

Now we present a few implementation details in the proposed method. As different human pose datasets may have different definitions for body joints, we choose to accommodate this difference from different supervision sources. The purpose is to take advantage of more human pose annotation sources, when using the 2D and 3D mixed training strategy [41]. Figure 4 illustrates the joint structures defined by the Human3.6M [14] and MPII [2] datasets, as well as our joint structure definition. We take the union of these two sets of joint definitions to form a 18-joint set as the regression target. Suppose performance evaluation is conducted on the Human3.6M dataset, then only those estimated joints used by Human3.6M will be evaluated, as in Martinez et al. ’s work [23].

To prepare the human bounding-box input for the proposed network, we crop from the original input image a square-shaped region based on the ground-truth bounding-box, and then resize it proportionally to . To obtain the final metric scale prediction from the network output (in voxel/pixel space), we resort to the average body bone length learned during the training phase to enable this prediction mapping. We did not use the ground-truth (depth) information during the test phase, e.g. the distance to the root/pelvis for obtaining the scaling factor.

Fig. 4: A unified body joint definition adopted in our method by merging the joints defined by the Human3.6M and MPII datasets.

We implement our method in PyTorch. The model is trained in an end-to-end manner using both images with 3D annotations (e.g., Human3.6M 

[14] or HumanEva-I [39]), and 2D annotations (MPII [2]). In our experiments, we adopt an adaptive value of in Eq. (1) for each skeletal part:  ( is the 3D Euclidean distance between the two end joints of the skeletal part ). The training data is further augmented with rotation (), scale (), horizontal flipping (with a probability of ) and color distortions. By using a batch size of , a learning rate of and Adam optimization, the training took K iterations to converge. It took about a few days () with four NVIDIA GTX 1080 GPUs to train the HEMlets pose estimation model.

4 HEMlets Body Model Regression

So far, we have presented the proposed HEMlets pose estimation method in detail. It is natural to consider whether the proposed method can be extended also to recover human body models from the given input color images. To this end, we design and append a shallow yet effective network module to the preceding HEMlets pose network, which leverages the 3D pose estimation accuracy to regress the parameters of the body shape and pose. In this work, we employ the popular 3D body SMPL model [23], where a human body mesh is parameterized by a 3D body shape parameter and a pose parameter .

As shown in Fig. 5, the newly added body model regression module is very simple. It takes the predicted 3D joint coordinates from the early stage as input, together with the high-level image features extracted from the given color image. This regression module is trained to regress the SMPL shape and pose parameters as final outputs. It is worth noting that we do not perform explicit human image segmentation, but instead use the high-level image features as implicit cues for shape regression.

Fig. 5: HEMlets-based parametric 3D human body regression from a single color image. We append a shallow yet effective SMPL body mesh regression network to the preceding HEMlets pose estimation network, which is trained end-to-end to regress the SMPL shape and pose parameters .

In our implementation, the additional regression module is trained together with the 3D pose network in an end-to-end manner. Similar to recent works [51, 19], the SMPL pose parameter is converted into 24 rotation matrices for pose regression, which avoids the known singularity problem of the axis-angle representation. Following the similar strategy, the SMPL pose loss is defined as:


where denotes the rotation matrix corresponding to the -th joint. The SMPL shape regression loss is simply computed using the loss as,


Finally, the end-to-end training loss for the parametric 3D human body regression is given by


where is the total loss defined for pose estimation in Eq. (8).

5 Weakly-Annotated FBI Dataset

Fig. 6: User annotation interface for obtaining the weakly-annotated FBI dataset. An annotator is asked to assign a label of either “Backward”, “Forward” or “Unknown” to a given skeletal part.

In this section, we introduce a new Forward-or-Backward Information (FBI) dataset, and elaborate its advantages in obtaining weak annotations for the relative depth relationship between a pair of skeletal joints. To prepare this FBI dataset, 12K images are randomly drawn from the MPII dataset [2], for which only 2D joint annotations are available. Then, each body part is assigned with a label of either “Backward”, “Forward” or “Unknown”. We designed a simple user interface to facilitate the annotation. As shown in Fig. 6, an annotator was presented with one image at a time with the native 2D skeleton overlaid over the input image. The annotator was asked to assign “Backward” or “Forward” labels to only a subset of the body parts for which she/he is confident with. The rest of the body parts are assigned with the “Unknown” labels by default.

5.1 Comparison with The Ordinal Dataset [30]

At the first glance, both FBI and Oridinal [30] annotation schemes aim at annotating the depth ordering between two body joints. However, the FBI scheme simplifies the annotation objective and reduces the annotation complexity with a good reason. The Ordinal scheme tries to annotate the relative depth information between every pair of joints. For each image, there are questions that need to be answered by the annotator. This annotation requirement is not only time-consuming, but also prone to human errors. The FBI scheme, on the other hand, only requires the annotator to answer at most 14 questions for each image. Furthermore, the annotator only needs to tell the relative depth ordering of two kinematically connected joints, which is intuitive and less prone to human errors, as illustrated in Fig. 7. Empirically, we observe that body parts with ambiguous relative depth ordering, namely, near-equal-depth joints are difficult to annotate with good accuracy. Therefore, the FBI annotation scheme only asks for “confident” annotations from annotators. Joint pairs can be skipped and retain an “Unknown” label by default.

Fig. 7: A simple illustration of the difference between the FBI and Ordinal annotation schemes. (a) Global relative depth ordering between disconnected joints (e.g., in the top-left image, and in the bottom-left image) need to be annotated in the Ordinal scheme, which are however challenging to annotate correctly. (b) In contrast, only local relative depth ordering between connected joints (e.g., and in the right-side images) need to be annotated in the FBI scheme.

5.2 FBI Annotation Quality and Speed

In order to assess the annotation quality of using the FBI scheme, 1000 images with 3D ground truth are randomly selected from the Human3.6M dataset [14]. Then, they are mixed with 12K in-the-wild images for user annotations. We briefed ten first-time annotators about the FBI scheme, and collected their annotations on a total number of 13K images. For evaluation, we retrieve the annotations of all the images from the Human3.6M dataset and compare them against the ground-truth relative depth relations. We find that when the ground truth tilt angle of the skeletal bone with respect to the image plane is greater than , the percentage of annotation errors is only and the percentage of skipped annotations is less than . However, when this tilt angle is below , both the rates of annotation errors and skipped annotations increase noticeably. This experimental study agrees with our conjecture that the body parts with small tilt angles (hence with ambiguous relative depth ordering) are much harder to annotate.

Regarding the annotation time, on average each image takes less than 20 seconds to annotate using the FBI scheme, while the Ordinal scheme needs roughly 1 minute per image.

6 Experiments

In this section, we evaluate the proposed HEMlets-based human pose and shape estimation methods by conducting comprehensive experiments over the main benchmark datasets.

Protocol #1 Direct Discuss Eating Greet Phone Photo Pose Purch. Sitting SittingD. Smoke Wait WalkD. Walk WalkT. Avg
LinKDE et al[14] 132.7 183.6 132.3 164.4 162.1 205.9 150.6 171.3 151.6 243.0 162.1 170.7 177.1 96.6 127.9 162.1
Tome et al[45] 65.0 73.5 76.8 86.4 86.3 110.7 68.9 74.8 110.2 173.9 85.0 85.8 86.3 71.4 73.1 88.4
Rogez et al[34] 76.2 80.2 75.8 83.3 92.2 105.7 79.0 71.7 105.9 127.1 88.0 83.7 86.6 64.9 84.0 87.7
Tekin et al[44] 54.2 61.4 60.2 61.2 79.4 78.3 63.1 81.6 70.1 107.3 69.3 70.3 74.3 51.8 74.3 69.7
Martinez et al[23] 53.3 60.8 62.9 62.7 86.4 82.4 57.8 58.7 81.9 99.8 69.1 63.9 67.1 50.9 54.8 67.5
Fang et al[11] 50.1 54.3 57.0 57.1 66.6 73.3 53.4 55.7 72.8 88.6 60.3 57.7 62.7 47.5 50.6 60.4
Pavlakos et al[30] 48.5 54.4 54.4 52.0 59.4 65.3 49.9 52.9 65.8 71.1 56.6 52.9 60.9 44.7 47.8 56.2
Sárándi et al[36] 51.2 58.7 51.7 53.4 56.8 59.3 50.7 52.6 65.5 73.2 56.8 51.4 56.6 47.0 42.4 55.8
Sun et al[41] 47.5 47.7 49.5 50.2 51.4 55.8 43.8 46.4 58.9 65.7 49.4 47.8 49.0 38.9 43.8 49.6
Sharma et al[37] 48.6 54.5 54.2 55.7 62.6 72.0 50.5 54.3 70.0 78.3 58.1 55.4 61.4 45.2 49.7 58.0
Chen et al[9] 41.1 44.2 44.9 45.9 46.5 39.3 41.6 54.8 73.2 46.2 48.7 42.1 35.8 46.6 38.5 46.3
Ours 34.4 42.4 36.6 42.1 38.2 39.8 34.7 40.2 45.6 60.8 39.0 42.6 42.0 29.8 31.7 39.9
Protocol #2 Direct Discuss Eating Greet Phone Photo Pose Purch. Sitting SittingD. Smoke Wait WalkD. Walk WalkT. Avg
Nie et al[27] 90.1 88.2 85.7 95.6 103.9 92.4 90.4 117.9 136.4 98.5 103.0 94.4 86.0 90.6 89.5 97.5
Chen et al[7] 53.3 46.8 58.6 61.2 56.0 58.1 41.4 48.9 55.6 73.4 60.3 45.0 76.1 62.2 51.1 57.5
Martinez et al[23] 39.5 43.2 46.4 47.0 51.0 56.0 41.4 40.6 56.5 69.4 49.2 45.0 49.5 38.0 43.1 47.7
Fang et al[11] 38.2 41.7 43.7 44.9 48.5 55.3 40.2 38.2 54.5 64.4 47.2 44.3 47.3 36.7 41.7 45.7
Pavlakos et al[30] 34.7 39.8 41.8 38.6 42.5 47.5 38.0 36.6 50.7 56.8 42.6 39.6 43.9 32.1 36.5 41.8
Yang et al[52] 26.9 30.9 36.3 39.9 43.9 47.4 28.8 29.4 36.9 58.4 41.5 30.5 29.5 42.5 32.2 37.7
Sharma et al[37] 35.3 35.9 45.8 42.0 40.9 52.6 36.9 35.8 43.5 51.9 44.3 38.8 45.5 29.4 34.3 40.9
Ours 29.1 34.9 29.9 32.6 31.2 32.3 27.0 33.3 37.6 45.9 32.2 31.5 34.5 22.9 25.9 32.1
PA MPJPE Direct Discuss Eating Greet Phone Photo Pose Purch. Sitting SittingD. Smoke Wait WalkD. Walk WalkT. Avg
Yasin et al[54] 88.4 72.5 108.5 110.2 97.1 81.6 107.2 119.0 170.8 108.2 142.5 86.9 92.1 165.7 102.0 108.3
Sun et al[41] 36.9 36.2 40.6 40.4 41.9 34.9 35.7 50.1 59.4 40.4 44.9 39.0 30.8 39.8 36.7 40.6
Dabral et al[10] 28.0 30.7 39.1 34.4 37.1 44.8 28.9 32.2 39.3 60.6 39.3 31.1 37.8 25.3 28.4 36.3
Ours 21.6 27.0 29.7 28.3 27.3 32.1 23.5 30.3 30.0 37.7 30.1 25.3 34.2 19.2 23.2 27.9
TABLE I: Quantitative comparisons of the mean per-joint position error (MPJPE) on Human3.6M [14]

under Protocol #1 and Protocol #2, as well as using PA MPJPE as the evaluation metric. Similar to most of the competing methods (e.g., 

[41, 30, 52, 10, 44, 11]), our models were trained on the Human3.6M dataset and used also the extra MPII 2D pose dataset [2].

6.1 3D Human Pose Estimation

We perform quantitative evaluation on three benchmark datasets: Human3.6M [14], HumanEva-I [39] and MPI-INF-3DHP [24]. Ablation study is conducted to evaluate our design choices. We demonstrate that the proposed method shows superior generalization ability to in-the-wild images.

6.1.1 Datasets and Evaluation Protocols

Human3.6M. Human3.6M [14] contains 3.6 million RGB images captured by a MoCap System in an indoor environment, in which 7 professional actors were performing 15 activities such as walking, eating, sitting, making a phone call and engaging in a discussion, etc. We follow the standard protocol as in [23, 31], and use 5 subjects (S1, S5, S6, S7, S8) for training and the rest 2 subjects (S9, S11) for evaluation (referred to as Protocol #1). Some previous works reported their results with 6 subjects (S1, S5, S6, S7, S8, S9) used for training and only S11 for evaluation [54, 41, 10] (referred to as Protocol #2). Despite not using S9 also as training data, we compare our results with these methods.

HumanEva-I. HumanEva-I [39] is one of the early datasets for evaluating 3D human poses. It contains fewer subjects and actions compared to Human3.6M. Following [3], we train a single model on the training sequences of Subject 1, 2 and 3, and evaluate on the validation sequences.

MPI-INF-3DHP. This is a recent 3D human pose dataset which includes both indoor and outdoor scenes [24]. Without using its training set, we evaluate our model trained from Human3.6M only on the test set. The results are reported using the 3DPCK and the AUC metric [2, 24, 30].

Evaluation metric. We follow the standard steps to align the 3D pose prediction with the groundtruth by aligning the position of the central hip joint, and use the Mean Per-Joint Position Error (MPJPE) between the groundtruth and the prediction as evaluation metrics. In some prior works [54, 41, 10], the pose prediction was further aligned with the groundtruth via a rigid transformation. The resulting MPJPE is termed as Procrustes Aligned (PA) MPJPE.

6.1.2 Results and Comparisons

Human3.6M. We compare our method against state-of-the-art under three protocols, and the quantitative results are reported in Table I. As can be seen, our method outperforms all competing methods on nearly all action subjects for the protocols used. It is worth mentioning that our method makes considerable improvements on some challenging actions for 3D pose estimation such as Sitting and Walking. Thanks to HEMlets learning, our method demonstrates a clear advantage for handling complicated poses.

With a simple network architecture and little parameter tuning, we produce the most competitive results compared to previous works with carefully designed networks powered by e.g., adversarial training schemes or prior knowledge. On average, we improve the 3D pose prediction accuracy by than that reported in Sun et al[41] under Protocol #1. We also report our performance using PA MPJPE as the evaluation metric, and compare with these methods that make use of S9 as additional training data. We still outperform all of them across all action subjects, even without utilizing S9 for training.

Approach Walking Jogging Avg
S1 S2 S3 S1 S2 S3
Simo-Serra et al[40] 65.1 48.6 73.5 74.2 46.6 32.2 56.7
Moreno-Noguer et al[25] 19.7 13.0 24.9 39.7 20.0 21.0 26.9
Martinez et al[23] 19.7 17.4 46.8 26.9 18.2 18.6 24.6
Fang et al[11] 19.4 16.8 37.4 30.4 17.6 16.3 22.9
Pavlakos et al[30] 18.8 12.7 29.2 23.5 15.4 14.5 18.3
Ours 13.5 9.9 17.1 24.5 14.8 14.4 15.2
TABLE II: Detailed results on the validation set of HumanEva-I [24].

HumanEva-I. With the same network architecture where only the HumanEva-I dataset is used for training, our results are reported in Table II under the popular protocol [40, 25, 23, 11, 30]. Different from these approaches [30, 25, 23, 11] which used extra 2D datasets (e.g., MPII) or pre-trained 2D detectors (e.g., CPM [50]), our method still outperforms previous approaches.

Approach Studio Studio Outdoor All All
GS no GS
Mehta et al[24] 70.8 62.3 58.8 64.7 31.7
Zhou et al[56] 71.1 64.7 72.7 69.2 32.5
Pavlakos et al[30] 76.5 63.1 77.5 71.9 35.3
Ours 75.6 71.3 80.3 75.3 38.0
TABLE III: Detailed results on the test set of MPI-INF-3DHP [24]. No training data from this dataset was used to train our model.

MPI-INF-3DHP. We evaluate our method on the MPI-INF-3DHP dataset using two metrics, the PCK and AUC. The results are generated by the model we trained for Human3.6M. In Table III, we compare with three recent methods which are not trained on this dataset. Our result of “Studio GS” is one percentage lower than [30]. But our method outperforms all these methods with particularly large margins for the “Outdoor” and “Studio no GS” sequences.

6.1.3 Ablation Study

We study the influence on the final estimation performance of different choices made in our network design and training procedure.

Alternative intermediate supervision. First, We examine the effectiveness of using HEMlets supervision. We evaluate the model trained without any intermediate supervision (Baseline), with 2D heatmap supervision only, with HEMlets supervision only, and with both 2D heatmap supervision and HEMlets supervision (Full). All of these design variants are evaluated with the same experimental setting (including training data, network architecture and loss definition) under Protocol #1 on Human3.6M.

Method Supervision H3.6M #1 H3.6M #1

47.1 55.3
w/ 2D heatmaps 44.2 49.9
w/ HEMlets 42.6 46.0
Full 39.9 45.1

TABLE IV: Ablative study on the effects of alternative intermediate supervision evaluated on Human3.6M using Protocol #1. The last column  reports the results using only the Human3.6M dataset for training (without using the extra MPII 2D pose dataset).

The detailed results are presented in Table IV. Using 2D heatmaps supervision for training, the prediction error is reduced by 3.0mm compared to the baseline. The HEMlets supervision provided 1.7mm lower mean error compared to the 2D heatmaps supervision. This validates the effectiveness of the intermediate supervision. By combining all these choices, our approach using HEMlets with 2D heatmap supervision achieves the lowest error. Without using the extra MPII 2D pose dataset, we repeated this study. Similar conclusions can still be drawn. But the gap between w/ HEMlets (excluding , 46.0mm) and Full (45.1mm) shrinks, suggesting the strength of the HEMlets representation in encoding both 2D and (local) 3D information.

To further illustrate the effectiveness of HEMlets representation, we provide a visual comparison in Fig. 8. Though the 2D joint errors of the two estimations are quite close, the method with HEMlets learning significantly improves the 3D joint estimation result and fixes the gross limb errors.

Fig. 8: An example image with the detected joints overlaid and shown from a novel view, using different methods: (a) (2D error: 15.2; 3D joint error: 81.3mm). (b) (2D error: 13.0; 3D error: 41.2mm). (c) Ground-truth. HEMlets learning helps fixing local part errors, see the blue skeletal part in (a) versus the red skeletal part in (b).

Regarding the runtime, tested on NVIDIA GTX 1080 GPUs, our full model (with a total parameter number of 47.7M) takes 13.3ms for a single forward inference, while the baseline model (with 34.3M parameters) takes 8.5ms.

Variants of HEMlets. We next experimented with some variants of HEMlets on Human3.6M and MPII 2D pose datasets. In the first variant, we use five-state heatmaps, referred to as 5s-HEM, where the child joint is placed to different layers of the heatmaps according to the angle of the associated skeletal part with respect to the imaging plane. Specifically, we define the five states corresponding to the , , , and range, respectively. In the second variant, we place a pair of joints in the negative and positive polarity heatmaps respectively according to their depth ordering (i.e., the closer/farther joint will appear in the positive/negative polarity heatmap). If their depths are roughly the same, they are co-located in the zero polarity heatmap. We refer to this variant as 2s-HEM. We trained 5s-HEM, 2s-HEM and HEMlets with the Human3.6M dataset only. A comparison on the validation loss is given in Fig. 9.

The other two variants produce inferior convergence compared to HEMlets under the same experiment setting.

Fig. 9: The validation loss of 5s-HEM, 2s-HEM and HEMlets, respectively. All are trained with the Human3.6M dataset.
Dataset 3DPCK
Base 75.3
w/ Ordinal [30] 76.1
w/ FBI [38] 76.9

w/ FBI [38] + Ordinal [30]
TABLE V: Evaluation of 3DPCK scores by adding different augmenting datasets that provide relative depth ordering annotations. Base denotes using the base datasets (Human3.6M and MPII).

Augmenting datasets. Many state-of-the-art approaches use a mixed training strategy for 3D human pose estimation. In addition to exploiting Human3.6M and MPII datasets, we study the effect of using augmenting datasets such as Ordinal [30] and FBI [38] for training. Firstly, we adapt the annotations of Ordinal and FBI datasets to the required form of HEMlets. Then we train our model using different combinations of these additional datasets. The comparisons on the MPI-INF-3DHP dataset [24] are reported in Table V. We find augmenting datasets slightly increase the 3DPCK score for the trained model. Interestingly, training with FBI annotations attains a better 3DPCK score than Ordinal annotations. We suspect this is due to the amount of manual annotation errors related to different annotation schemes. In Fig. 10, we also provide some visual examples to compare the effectiveness of different augmenting datasets. One can find that the model fine-tuned with the FBI dataset produces better predictions than the ones trained additionally with Ordinal [30].

Fig. 10: The qualitative results for some examples of MPI-INF-3DHP [24], using different additional datasets. For each example, we present the input RGB image, the 3D human pose predicted by three different models. The groundtruth pose is shown in dashed line.
Fig. 11: Qualitative results on different validation datasets: the first two columns are from the test dataset of 3DHP [24]. The other columns are from Leeds Sports Pose (LSP) [15]. Our approach produces visually correct results even on challenging poses (last column).

Generalization. For an evaluation of in-the-wild images from Leeds Sports Pose (LSP) [15] and the validation set of MPI-INF-3DHP [24], we list some visual results predicted by our approach. As shown in Fig. 11, even for challenging data (e.g., self-occlusion, upside-down), our method yields visually correct pose estimations for these images.

6.2 3D Human Body Model Recovery

In this part, we evaluate the proposed human body recovery method of regressing the SMPL parameters on three public datasets i.e., SURREAL [47], UP-3D [20] and 3DPW[48]. Before the experimental studies, we first give an introduction to the datasets and related evaluation protocols.

6.2.1 Datasets and Evaluation Protocols

SURREAL. SURREAL [47] contains 6M frames from 1,964 video sequences of 115 subjects, where the images are photo-realistic renderings of people under large variations in shape, texture, viewpoint and pose. Because these synthetic bodies are created using SMPL body models, the corresponding model parameters are used as groundtruth for training a human body regression model.

UP-3D. The details of this dataset are presented in [20]. To build it up, a large number of real images were collected, and then each of them was fitted by a SMPL body model. Next, those inaccurate fitting results were picked and discarded manually. Finally, 5,703 training images, 1,423 validating images and 1,389 testing images with fitted SMPL parameters were obtained.

3DPW. Recently, the work of [48] presented a new dataset which is captured under in-the-wild environment. Specifically, a moving hand-held camera is used for recording RGB frames while IMUs are attached on actors to capture poses. In total, 60 video sequences (more than 51,000 frames) of 5 subjects are captured, where 7 actors with 18 different clothing styles are asked to perform different activities, such as walking, playing golf and etc.

Evaluation metric. We follow the standard protocols, as detailed in [53] to conduct evaluations. When dealing with the datasets of SURREAL and UP-3D, to measure the accuracy of the inferred body mesh, the average per-vertex Euclidean distance between it and the groundtruth is used (which is referred to as “surface”). We also report the accuracy of the output 3D pose, where the average per-joint Euclidean distance between the estimated pose (with the hip joint aligned) and the groundtruth is used (which is referred to as “joint”). For the dataset of 3DPW, we follow the works of [16, 18] to evaluate the reconstruction error of 3D poses, which is noted as “Rec. Error”. In addition, following the work of [51], the recovered 3D meshes are also projected onto a 2D image plane for evaluating the accuracy of the mask and part segmentation. By doing so, mIoU and F1 scores are reported.

Approach Human3.6M SURREAL UP-3D 3DPW
Pro.#1 Pro. #2 surface joint surface joint Rec. Error
Pavlakos et al[32] - 75.9 - - 117.7 - -
HMR [16] 88.0 59.1 - - - - 81.3
BodyNet  [46] - - 73.6 - - - -
SMPLR [22] 56.5 46.3 74.5 46.1 - - -
DenseRaC [51] 76.8 - - - - - -
TexturePose [29] 51.3 49.7 - - - - -
DenseBody [53] 47.3 38.1 54.2 40.1 91.7 71.4 -
SPIN (SPIN*) [18] - 41.1 - - - - 66.3 (59.2*)
Ours 39.9 32.1 53.3 37.7 79.8 67.5 58.8
TABLE VI: Quantitative comparisons of fully body model recovery results over different datasets. * denotes the version that also applies the SMPLify optimization [4] as post-processing.
Approach FB Seg. Part Seg.
Accuracy F1 Accuracy F1
SMPLify oracle [20] 92.17 0.88 88.82 0.67
SMPLify [4] 91.89 0.88 87.71 0.64
HMR [16] 91.67 0.87 87.12 0.60
SPIN [18] 91.07 0.86 88.48 0.65
SPIN* [18] 91.83 0.87 89.41 0.68
BodyNet  [46] 92.80 0.84 - -
DenseRaC [51] 92.40 0.88 87.9 0.64
Ours 92.30 0.88 90.18 0.71
Ours* 93.67 0.90 91.19 0.74
TABLE VII: Quantitative comparisons between our method and existing ones on foreground and part segmentation of the recovered full body mesh on the UP-3D dataset. * denotes the version that also applies the SMPLify optimization [4] as post-processing.

6.2.2 Results and Comparisons

Next, we report the evaluation results and also compare them with state-of-the-art methods both quantitatively and qualitatively.

Quantitative comparisons. In Table VI, we numerically compare our method to existing leading approaches on the evaluation metrics presented in Sect. 6.2.1. As can be seen, our method produces the best accuracy for both the output skeleton joints and the generated body mesh. Table VII also lists the accuracy of the foreground and the part segmentation, given the generated body mesh. Our proposed method again gives the best performance. It is noteworthy that the part segmentation F1 score of our method evaluated on the UP-3D dataset exceeds 0.70 for the first time.

Qualitative comparisons. We also conduct qualitative comparisons between our method and some of existing methods, as shown in Fig. 12. Here, HMR [16] and SPIN [18] are selected as two representative body mesh recovery approaches. Given an input image, the output body mesh of each method is shown in two views. It can be observed that our method performs better than HMR and SPIN, even when the human pose is challenging.

Fig. 12: Qualitative comparisons of our method with some existing ones on human body model recovery. For each example, the input image is first shown, which is followed by the results of HMR [16], SPIN [18] and ours. For each resulting body mesh, two views are provided for visualization.
Method Rec. Error measured on 3DPW
Proposed model 58.8
w/ groundtruth shape 57.2
w/ groundtruth pose 9.4

TABLE VIII: Evaluation of the impact of learning and on human body estimation.
Fig. 13: The results of the proposed approach on multi-person scenarios.

6.2.3 Extended Studies

Fig. 14: Some failure cases of the proposed approach.

We make a few extended studies to further understand the proposed HEMlets-based human pose and shape estimation method.

How does pose accuracy affect body recovery? As the accuracy of the output body mesh relies on both the estimated pose and the shape parameters. An interesting question is which factor affects more. To reach an answer, we run two alternative versions of our full model on the 3DPW dataset: 1) replacing its estimated shape parameter with the groundtruth shape, and 2) replacing the estimated pose parameter with the groundtruth pose. The results are reported in Table VIII. As one can see, the accuracy of pose estimation has a greater impact. This suggests 3D pose estimation is critical and provides more significant contributions to the task of human body mesh recovery from a single color image.

Multi-person 3D pose and shape. Fig. 13 shows our method can also work well for the multi-person scenarios. To do that, we firstly employ the code of OpenPose [6] to detect person instances. Each instance is then cropped, to which the proposed HEMlets PoSh approach is applied for individual 3D body model inference.

Failure cases. Our method tends to fail for some complicated scenarios, e.g., poor lighting, severe occlusions and background interference. Some of such failure cases are shown in Fig. 14.

More supplementary materials including demo videos are available at the project website: https://sites.google.com/site/hemletspose/.

7 Conclusion

In this paper, we proposed a simple and highly effective HEMlets-based 3D pose estimation method from a single color image. HEMlets is an easy-to-learn intermediate representation encoding the relative forward-or-backward depth relation for each skeletal part’s joints, together with their spatial co-location likelihoods. It is proved very helpful to bridge the input 2D image and the output 3D pose in the learning procedure. We demonstrated the effectiveness of the proposed method tested over the standard benchmarks, yielding a relative accuracy improvement of about 20% over one best-of-grade method [41] on the Human3.6M benchmark. Good generalization ability is also witnessed for the presented approach. Extending the HEMlets pose estimation network, we further designed a simple parametric 3D human body regression network to estimate the SMPL body shape and pose from the input color image. Extensive experiments have shown the strong outperformance of the proposed HEMlets PoSh method. For instance, the part segmentation F1 score of our method evaluated on the UP-3D dataset exceeds 0.70 for the first time.

We believe the proposed HEMlets idea is actually general, which may potentially benefit other 3D regression problems e.g., scene depth estimation. Future directions also include an optimized real-time system that detects and tracks multiple persons robustly in total 3D.


This work is supported in part by the National Natural Science Foundation of China (Grant No.: 61771201), the Program for Guangdong Introducing Innovative and Enterpreneurial Teams (Grant No.: 2017ZT07X183), the Pearl River Talent Recruitment Program Innovative and Entrepreneurial Teams in 2017 (Grant No.: 2017ZT07X152), the Shenzhen Fundamental Research Fund (Grants No.: KQTD2015033114415450 and ZDSYS201707251409055), and Department of Science and Technology of Guangdong Province Fund (2018B030338001). The authors would like to thank Yulong Shi and Kaiqi Wang for assisting in some early experiments.


  • [1] Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 7297–7306, 2018.
  • [2] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3686–3693, 2014.
  • [3] Liefeng Bo and Cristian Sminchisescu. Twin gaussian processes for structured prediction. International Journal of Computer Vision, 87(1-2):28, 2010.
  • [4] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In Proceedings of the European Conference on Computer Vision (ECCV), pages 561–578, 2016.
  • [5] Lubomir Bourdev and Jitendra Malik. Poselets: Body part detectors trained using 3d human pose annotations. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1365–1372, 2009.
  • [6] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7291–7299, 2017.
  • [7] Ching-Hang Chen and Deva Ramanan. 3d human pose estimation = 2d pose estimation + matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7035–7043, 2017.
  • [8] Wenzheng Chen, Huan Wang, Yangyan Li, Hao Su, Zhenhua Wang, Changhe Tu, Dani Lischinski, Daniel Cohen-Or, and Baoquan Chen. Synthesizing training images for boosting human 3d pose estimation. In Internaional Conference on 3D Vision (3DV), pages 479–488, 2016.
  • [9] Xipeng Chen, Kwan-Yee Lin, Wentao Liu, Chen Qian, and Liang Lin. Weakly-supervised discovery of geometry-aware representation for 3d human pose estimation. In CVPR, pages 10895–10904, 2019.
  • [10] Rishabh Dabral, Anurag Mundhada, Uday Kusupati, Safeer Afaque, Abhishek Sharma, and Arjun Jain. Learning 3d human pose from structure and motion. In Proceedings of the European Conference on Computer Vision (ECCV), pages 668–683, 2018.
  • [11] Hao-Shu Fang, Yuanlu Xu, Wenguan Wang, Xiaobai Liu, and Song-Chun Zhu. Learning pose grammar to encode human body configuration for 3d pose estimation. In

    Association for the Advancement of Artificial Intelligence (AAAI)

    , 2018.
  • [12] Golnaz Ghiasi and Charless C Fowlkes. Occlusion coherence: Localizing occluded faces with a hierarchical deformable part model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2385–2392, 2014.
  • [13] Riza Alp Guler and Iasonas Kokkinos. Holopose: Holistic 3d human reconstruction in-the-wild. In CVPR, pages 10884–10894, 2019.
  • [14] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, 2014.
  • [15] Sam Johnson and Mark Everingham. Clustered pose and nonlinear appearance models for human pose estimation. In British Machine Vision Conference (BMVC), 2010.
  • [16] Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7122–7131, 2018.
  • [17] Lipeng Ke, Ming-Ching Chang, Honggang Qi, and Siwei Lyu. Multi-scale structure-aware network for human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 713–728, 2018.
  • [18] Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In ICCV, 2019.
  • [19] Nikos Kolotouros, Georgios Pavlakos, and Kostas Daniilidis. Convolutional mesh regression for single-image human shape reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4501–4510, 2019.
  • [20] Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J Black, and Peter V Gehler. Unite the people: Closing the loop between 3d and 2d human representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6050–6059, 2017.
  • [21] Sijin Li and Antoni B Chan. 3d human pose estimation from monocular images with deep convolutional neural network. In Asian Conference on Computer Vision (ACCV), pages 332–347, 2014.
  • [22] Meysam Madadi, Hugo Bertiche, and Sergio Escalera. Smplr: Deep smpl reverse for 3d human pose and shape recovery. arXiv preprint arXiv:1812.10766, 2018.
  • [23] Julieta Martinez, Rayat Hossain, Javier Romero, and James J Little. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2640–2649, 2017.
  • [24] Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3d human pose estimation in the wild using improved CNN supervision. In Internaional Conference on 3D Vision (3DV), pages 506–516, 2017.
  • [25] Francesc Moreno-Noguer. 3d human pose estimation from a single image via distance matrix regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1561–1570, 2017.
  • [26] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 483–499, 2016.
  • [27] Bruce Xiaohan Nie, Ping Wei, and Song-Chun Zhu. Monocular 3d human pose estimation by predicting depth on joints. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3467–3475, 2017.
  • [28] Sungheon Park, Jihye Hwang, and Nojun Kwak. 3d human pose estimation using convolutional neural networks with 2d pose information. In Proceedings of the European Conference on Computer Vision (ECCV), pages 156–169, 2016.
  • [29] Georgios Pavlakos, Nikos Kolotouros, and Kostas Daniilidis. Texturepose: Supervising human mesh estimation with texture consistency. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 803–812, 2019.
  • [30] Georgios Pavlakos, Xiaowei Zhou, and Kostas Daniilidis. Ordinal depth supervision for 3d human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7307–7316, 2018.
  • [31] Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpanis, and Kostas Daniilidis. Coarse-to-fine volumetric prediction for single-image 3d human pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1263–1272, 2017.
  • [32] Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. Learning to estimate 3d human pose and shape from a single color image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 459–468, 2018.
  • [33] Gerard Pons-Moll, David J Fleet, and Bodo Rosenhahn. Posebits for monocular human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2337–2344, 2014.
  • [34] Grégory Rogez and Cordelia Schmid. Mocap-guided data augmentation for 3d pose estimation in the wild. pages 3108–3116, 2016.
  • [35] Matteo Ruggero Ronchi, Oisin Mac Aodha, Robert Eng, and Pietro Perona. It’s all relative: Monocular 3d human pose estimation from weakly supervised data. In British Machine Vision Conference (BMVC), 2018.
  • [36] István Sárándi, Timm Linder, Kai O Arras, and Bastian Leibe. How robust is 3d human pose estimation to occlusion? In IROS Workshop - Robotic Co-workers 4.0, 2018.
  • [37] Saurabh Sharma, Pavan Teja Varigonda, Prashast Bindal, Abhishek Sharma, and Arjun Jain. Monocular 3d human pose estimation by generation and ordinal ranking. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019.
  • [38] Yulong Shi, Xiaoguang Han, Nianjuan Jiang, Kun Zhou, Kui Jia, and Jiangbo Lu. FBI-pose: Towards bridging the gap between 2d images and 3d human poses using forward-or-backward information. arXiv preprint arXiv:1806.09241, 2018.
  • [39] Leonid Sigal, Alexandru O Balan, and Michael J Black. HumanEva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1-2):4, 2010.
  • [40] Edgar Simo-Serra, Ariadna Quattoni, Carme Torras, and Francesc Moreno-Noguer. A joint model for 2d and 3d pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3634–3641, 2013.
  • [41] Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. Integral human pose regression. In Proceedings of the European Conference on Computer Vision (ECCV), pages 529–545, 2018.
  • [42] Vince Tan, Ignas Budvytis, and Roberto Cipolla. Indirect deep structured learning for 3d human body shape and pose prediction. 2018.
  • [43] Bugra Tekin, Isinsu Katircioglu, Mathieu Salzmann, Vincent Lepetit, and Pascal Fua. Structured prediction of 3d human pose with deep neural networks. In British Machine Vision Conference (BMVC), 2016.
  • [44] Bugra Tekin, Pablo Márquez-Neila, Mathieu Salzmann, and Pascal Fua. Learning to fuse 2d and 3d image cues for monocular body pose estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3941–3950, 2017.
  • [45] Denis Tome, Chris Russell, and Lourdes Agapito. Lifting from the deep: Convolutional 3d pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2500–2509, 2017.
  • [46] Gul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia Schmid. Bodynet: Volumetric inference of 3d human body shapes. In Proceedings of the European Conference on Computer Vision (ECCV), pages 20–36, 2018.
  • [47] Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J Black, Ivan Laptev, and Cordelia Schmid. Learning from synthetic humans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4627–4635, 2017.
  • [48] Timo von Marcard, Roberto Henschel, Michael J Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3d human pose in the wild using imus and a moving camera. In ECCV, pages 601–617, 2018.
  • [49] Bastian Wandt and Bodo Rosenhahn. Repnet: Weakly supervised training of an adversarial reprojection network for 3d human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7782–7791, 2019.
  • [50] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4732, 2016.
  • [51] Yuanlu Xu, Song-Chun Zhu, and Tony Tung. Denserac: Joint 3d pose and shape estimation by dense render-and-compare. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 7760–7770, 2019.
  • [52] Wei Yang, Wanli Ouyang, Xiaolong Wang, Jimmy Ren, Hongsheng Li, and Xiaogang Wang. 3d human pose estimation in the wild by adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5255–5264, 2018.
  • [53] Pengfei Yao, Zheng Fang, Fan Wu, Yao Feng, and Jiwei Li. Densebody: Directly regressing dense 3d human pose and shape from a single color image. arXiv preprint arXiv:1903.10153, 2019.
  • [54] Hashim Yasin, Umar Iqbal, Bjorn Kruger, Andreas Weber, and Juergen Gall. A dual-source approach for 3d pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4948–4956, 2016.
  • [55] Kun Zhou, Xiaoguang Han, Nianjuan Jiang, Kui Jia, and Jiangbo Lu. HEMlets pose: Learning part-centric heatmap triplets for accurate 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2344–2353, 2019.
  • [56] Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, and Yichen Wei. Towards 3d human pose estimation in the wild: a weakly-supervised approach. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 398–407, 2017.