Code for our CVPR2019 paper: Generating Multiple Hypotheses for 3D Human Pose Estimation with Mixture Density Network
3D human pose estimation from a monocular image or 2D joints is an ill-posed problem because of depth ambiguity and occluded joints. We argue that 3D human pose estimation from a monocular input is an inverse problem where multiple feasible solutions can exist. In this paper, we propose a novel approach to generate multiple feasible hypotheses of the 3D pose from 2D joints.In contrast to existing deep learning approaches which minimize a mean square error based on an unimodal Gaussian distribution, our method is able to generate multiple feasible hypotheses of 3D pose based on a multimodal mixture density networks. Our experiments show that the 3D poses estimated by our approach from an input of 2D joints are consistent in 2D reprojections, which supports our argument that multiple solutions exist for the 2D-to-3D inverse problem. Furthermore, we show state-of-the-art performance on the Human3.6M dataset in both best hypothesis and multi-view settings, and we demonstrate the generalization capacity of our model by testing on the MPII and MPI-INF-3DHP datasets. Our code is available at the project website.READ FULL TEXT VIEW PDF
3D human pose estimation from monocular images is a highly ill-posed pro...
3D human pose estimation from a single image is an inverse problem due t...
We propose a method to generate multiple diverse and valid human pose
Mixture models are well-established machine learning approaches that, in...
Monocular 3D Human Pose Estimation from static images is a challenging
Many prediction tasks contain uncertainty. In some cases, uncertainty is...
In this paper, we introduce an unsupervised feature extraction method th...
Code for our CVPR2019 paper: Generating Multiple Hypotheses for 3D Human Pose Estimation with Mixture Density Network
3D human pose estimation from a single RGB image is an extensively studied problem in computer vision because of many potential useful real world applications such as forensic science, sports analysis and surveillanceetc. Significant progress in 3D human pose estimation has been made with deep learning in the recent years. One of the commonly used and effective deep learning based methods for 3D human pose estimation is the two-stage approach, where the 2D joints are first detected from the image input [18, 24] followed by the 3D joint estimations from the detected 2D joints [1, 29, 4, 15, 10, 6, 25, 17]. The advantage of the two-stage approach is that it decouples the harder problem of 3D depth estimation from the easier 2D pose estimation. In particular, variations in background scene, lighting, clothing shape, skin color etc. are removed before the 3D joint estimation stage. Furthermore, the model can be trained on different domains, e.g. indoor and outdoor, with 2D annotations that are readily available.
Despite the significant progress with deep learning, 3D human pose estimation remains as a very challenging task due to the ambiguity in recovering 3D information from a single RGB image. More specifically, recovering 3D information from a single RGB image or 2D joint locations is an inverse problem  where multiple solutions may exist for the depth of a 3D joint along the light ray that reprojects onto the same 2D joint location, as illustrated in Figure 1. The problem is further aggravated by the non-rigidity of the human pose and joint occlusions on the 2D image. Consequently, there could be many solutions of the 3D pose that satisfy the same 2D pose on an image, even after eliminating the infeasible 3D pose solutions by enforcing various geometric constraints, e.g. joint limits  and bone ratio  etc. In view of the inherent ambiguity of the 3D human pose estimation problem, we argue that it is more reasonable to design a model that generates multiple hypotheses of geometrically feasible 3D human pose that are consistent with the detected 2D joints from a single RGB image. In contrast, the widely adopted single estimate for the inverse problem with inherent ambiguity could lead to overfitting the model to the training data, and might not generalize well. This idea of generating multiple 3D pose hypotheses was first suggested very recently by Jahangiri and Yuille in .
To this end, we introduce the mixture density networks (MDN) [3, 26] to the 3D joint estimation module of the two-stage approach. Contrary to most existing works that generate a single 3D pose by minimizing the negative log-likelihood of an unimodal Gaussian, i.e. a mean squared error, we propose to estimate multiple hypotheses of the 3D pose by minimizing the negative log-likelihood of a multimodal mixture-of-Gaussians. The outputs of our mixture model is a set of mixing coefficients and parameters of the Gaussian kernels, i.e
. means and variances. The set of 3D pose hypotheses are given by the means of the Gaussian kernels, and the mixing coefficient and variances represent the uncertainties of each 3D pose hypothesis. Specifically our network consists of a feature extractor to lift the 2D joints into a feature space, and a hypotheses generator to generate multiple hypotheses. The whole network is a simple network made up of several linear layers with different non-linear activation units.
We show that our network achieves state-of-the-art results on the Human3.6M dataset  in both best hypothesis and multi-view settings. We also report results of our network on the outdoor MPII  dataset and the MPI-INF-3DHP  dataset, where 3D pose labels are not used for training the network. Furthermore, we show the robustness of our network by applying it to scenarios where one or two limb joints are occluded/missing. Our main contributions are as follows:
We explore the idea of generating multiple 3D pose hypotheses to alleviate the ambiguity problem that has not received much attention in the literature.
To the best of our knowledge, we are the first to introduce the mixture density model into 3D human pose estimation, which is more powerful than the single Gaussian distribution.
Our network achieves state-of-the art results on Human3.6M dataset in both best hypothesis and multi-view settings, and in cases where one or two limb joints are occluded/missing.
Existing human 3D pose estimation approaches fall into two categories according to their training techniques. The first category is to train deep convolutional neural networks (CNNs) end-to-end to estimate 3D human poses directly from the input images[20, 16, 28, 19, 27, 14, 23]. Zhou et al. 
use a sparse representation for 3D poses and predict the 3D pose with an expectation-maximization (EM) algorithm. The 2D poses are regarded as a hidden variable in the EM algorithm to remove the need of synchronized 2D-3D data. Parket al.  improve conventional CNNs by concatenating 2D pose estimation as well as information on relative positions with respect to multiple joints. Pavlakos et al.  use volumetric representation to represent 3D poses and adopt the stacked hourglass network , which is originally designed for 2D pose estimation, to predict 3D volumetric heatmaps. Mehta et al. 
use transfer learning to transfer the knowledge learned for 2D pose estimation to the task of 3D pose estimation. Similarly, Zhouet al.  propose a weakly-supervised transfer learning method that uses mixed 2D and 3D labels. The 2D pose estimation sub-network and 3D depth regression sub-network share the same features such that the 3D pose labels for indoor environments can be transferred to in-the-wild images. The direct approach benefits from the rich information contained in images, e.g. the front-back orientation of limbs. However, it will also be affected by a number of factors such as background, lighting, clothing etc. A network trained on one dataset can not be generalized well to the other datasets with different environment, for example from indoor and outdoor environment.
The second category [1, 29, 4, 15, 10, 6, 25, 17] decouples 3D pose estimation into the well-studied 2D joint detection [18, 24] and 3D pose estimation from the detected 2D joints. Akhter et al.  propose a multi-stage approach to estimate the 3D pose from 2D joints using an over-complete dictionary of poses. Bogo et al.  estimate 3D pose by first fitting a statistical body shape model to the 2D joints, and then minimizing the error between the reprojected 3D model and detected 2D joints. Chen  and Yasin  regard 3D pose estimation as a matching between the estimated 2D pose and the 3D pose from a large pose library. Martinez et al.  design a simple fully connected residual network to regress 3D pose from 2D joint detections. The decoupled approach can make use of both indoor and in-the-wild images to train the 2D pose estimators. More importantly, this approach is domain invariant since the input of the second stage is the 2D joints. However, estimating 3D pose from 2D joints is more challenging because 2D pose data contains less information than images, thus there are more ambiguities.
To solve the ill-posed problem of estimating 3D pose from 2D joints, Jahangiri and Yuille 
first proposed to generate multiple diverse pose hypotheses. They first learned a 3D Gaussian mixture model (GMM) model from a uniformly sampled set of Human3.6M poses, and then use conditional sampling to get samples of the 3D poses with reprojected joints errors that are within a threshold. Inspired by their work, we solve the ambiguity problem by generating multiple hypotheses. Instead of using the traditional GMM approach, we introduce the MDN which was first proposed in . The MDN can represent arbitrary conditional distributions by combining a conventional neural network with a mixture density model. Ye et al.  used a hierarchical MDN to solve the occlusion problem in hand pose estimation. Inspired by the work of Ye et al., we use the MDN to solve the depth ambiguity and occlusion problem in 3D human pose estimation.
Figure 2 shows the illustration of our deep network to generate multiple hypotheses for 3D human pose estimation. Our network follows the commonly used two-stage approach that first estimates the 2D joints from the input images followed by the 3D pose estimation from the estimated 2D joints. We adopt the state-of-the-art stacked hourglass  network as the 2D joint estimation module, and use our MDN which consists of a feature extractor and a hypotheses generator to generate the multiple 3D pose hypotheses. Given the 2D joint detections , where is the number of joints in one pose, our goal is to learn a function which maps x into a set of output parameters for our mixture model. , and are the means, variances and mixing coefficients of the mixture model. is the number of Gaussian kernels. The mean of each Guassian kernel represents one 3D pose hypothesis, and the number of Gaussian kennels decides the number of hypotheses generated by our model.
The probability density of the 3D posegiven the 2D joints is represented as a linear combination of Gaussian kernel functions
where is the number of Gaussian kernels, i.e. the number of hypotheses.
is the mixing coefficients, which can be regarded as a prior probability of a 3D pose datay being generated from the Gaussian kernel given the input 2D joints x. Here must satisfy the constraint
is the conditional density of the 3D pose y for the kernel, which can be expressed as a Gaussian distribution
and denote the mean and variance of the kernel, respectively. is the dimension of the output 3D pose y. All the parameters of the mixture model, including the mixing coefficients , the mean and the variance are functions of the input 2D pose x.
Note that the mixture model degenerates to a single Gaussion distribution when the means and variances of all Gaussian kernels are similar, i.e. , for . Hence,
Specifically in our case, the 3D pose hypotheses generated by the MDN will collapse into approximately a single Gaussian when the given 2D pose is simple and less ambiguous, e.g. no occlusions and/or missing joints.
From Eqn. (1), (2) and (3), we can see that all parameters of the Gaussian mixture distribution of y are functional form of x. Hence, we learn this function using a deep network which can be expressed as
where w is the set of learnable weights in the deep network. The probability density in Eqn. (1) can be rewritten to include the learnable weights w of the deep network, i.e.,
The parameters are now dependent on the learnable weights w of the deep network .
We modify the 3D pose estimation module in  to form our deep network . More specifically, our approach is a simple multilayer neural network. Given an input of 2D joints
, we use one linear layer to map the input into an 1024 dimensional feature space, followed by two residual blocks which respectively consists of a linear layer, batch normalization , dropout , and Rectified Linear Units. And there are residual connections between the input and output of each residual block. Different from which adds another linear layer to directly regress the 3D pose from the feature space, our network estimates the parameters
of the mixture model. In particular, we use different activation functions to satisfy the constraints of the three parameters. Specifically, we use a normal linear layer for parameter , a softmax function for the mixture coefficient so that it lies in the range of and sums up to , and a modified ELU function  defined as:
for the variance to keep it positive. Here, is a scale for negative factor.
Given a training dataset with pairs of ground truth labels for the corresponding 2D joints X and 3D poses Y, i.e. , the objective is to find the maximum a posterior of the set of learnable weights w. More formally, assuming that each training data is independent and identically distributed (i.i.d), the posterior distribution of w is given by
is the hyperparameter of the prior over the learnable weightsw. Hence, the optimal weight can be obtained from the minimization of the negative log-posterior
is taken to be the loss function for training our deep network. More specifically,
The prior loss can be further evaluated into:
where the term can be dropped in the loss function since it is independent of w
, and we write the random variablesin its functional form given by the deep network. We further assume a uniform prior over and
, and a Dirichlet conjugate prior over the mixing coefficientsthat follows a Categorical distribution, we get
is the Gamma function, and are the hyperparameters of the Dirichlet distribution, where for . The total loss function to train our deep network is given by
Note that we drop in because it is independent of w.
The term regularizes the mixing coefficients of our mixture model. Setting for implies that we have no prior knowledge over the mixing coefficients. In our experiments, we set , where is a constant scalar value to prevent overfitting of a single Gaussian kernel in the MDN to the training data, i.e. a single mixing coefficient and .
|LinKDE et al. ||132.7||183.6||132.3||164.4||162.1||205.9||150.6||171.3||151.6||243.0||162.1||170.7||177.1||96.6||127.9||162.1|
|Du et al. ||85.1||112.7||104.9||122.1||139.1||135.9||105.9||166.2||117.5||226.9||120.0||117.7||137.4||99.3||106.5||126.5|
|Zhou et al. ||87.4||109.3||87.1||103.2||116.2||143.3||106.9||99.8||124.5||199.2||107.4||118.1||114.2||79.4||97.7||113.0|
|Pavlakos et al. ||67.4||71.9||66.7||69.1||72.0||77.0||65.0||68.3||83.7||96.5||71.7||65.8||74.9||59.1||63.2||71.9|
|Jahangiri et al. ||63.1||55.9||58.1||64.5||68.7||61.3||55.6||86.1||117.6||71.0||71.2||66.3||57.1||62.5||61.0||68.0|
|Zhou et al. ||54.8||60.7||58.2||71.4||62.0||65.5||53.8||55.6||75.2||111.6||64.1||66.0||51.4||63.2||55.3||64.9|
|Martinez et al. ||51.8||56.2||58.1||59.0||69.5||78.4||55.2||58.1||74.0||94.6||62.3||59.1||65.1||49.5||52.4||62.9|
|Lee et al. ||43.8||51.7||48.8||53.1||52.2||74.9||52.7||44.6||56.9||74.3||56.7||66.4||47.5||68.4||45.6||55.8|
|Yasin et al. ||88.4||72.5||108.5||110.2||97.1||142.5||81.6||107.2||119.0||170.8||108.2||86.9||92.1||165.7||102.0||110.1|
|Bogo et al. ||62.0||60.2||67.8||76.5||92.1||77.0||73.0||75.3||100.3||137.3||83.4||77.3||86.8||79.7||87.7||82.3|
|Moreno et al. ||66.1||61.7||84.5||73.7||65.2||67.2||60.9||67.3||103.5||74.6||92.6||69.6||71.5||78.0||73.2||74.0|
|Martinez et al. ||39.5||43.2||46.4||47.0||51.0||56.0||41.4||40.6||56.5||69.4||49.2||45.0||49.5||38.0||43.1||47.7|
|Lee et al. ||37.4||38.9||45.6||43.8||48.5||54.6||39.9||39.2||53.0||68.5||51.5||38.4||33.2||55.8||37.8||45.7|
Our model is implemented in Tensorflow, and we use the ADAM optimizer with an initial learning rate of 0.001 and exponential decay. The batch size is set to 64 and we initialize the weights of linear layers with the Kaiming initialization . The number of Gaussian kernels is set to 5 and the hyperparameters in Eqn. (14
) are set to 2. We train our network for 200 epoches with a dropout rate of 0.5. We also apply max-norm constraint on the weight of each layer so that it is in range. Moreover, we clip the value of to and to to prevent the training loss from becoming NaN. We also use the log-sum-exp trick as previous work  to avoid the underflow problem.
We show numerical results for the Human3.6M dataset  and compare with other state-of-the-art approaches. We also apply our approach to other datasets including MPII  and MPI-INF-3DHP  datasets to test the generalization capacity of our network.
This is currently the largest available video pose dataset, which provides accurate 3D body joint locations recorded by a Vicon motion capture system. There are 15 activity scenarios in total such as “Walking”, “Eating”, “Sitting” and “Discussion”, each action is performed by 7 professional actors. Accurate 2D joint locations , 3D pose annotations and camera parameters are provided. Following 
, we apply standard normalization to the 2D inputs and 3D outputs by subtracting the mean and dividing by the standard deviation of the training data. We also zero-center the 3D poses around the hip joint since we do not predict the global position of the 3D pose.
This is a standard dataset for 2D human pose estimation, which contains 25K unconstrained images collected from YouTube videos. This is the most challenging in-the-wild dataset and we use it to test the generalization of our approach. We report qualitative result for this datset because 3D pose information is not provided.
It is a newly released 3D human pose dataset which is captured by a Mocap system in both indoor and outdoor scenes. We only use the test split of this dataset that includes 2935 frames from six subjects performing seven actions.
We use the state-of-the-art stacked hourglass network  to get the 2D joint detections. The stacked hourglass network is pretrained on the MPII dataset and then fine-tuned on the Human3.6M dataset.
For the Human3.6M dataset, we follow the standard protocol of using S1, S5, S6, S7 and S8 for training, and S9 and S11 for testing. The evaluation metric is the Mean Per Joint Position Error (MPJPE) in millimeter between the ground truth and the estimated 3D pose. Since our network generating multiple hypotheses for each 2D detection, we follow to compute the MPJPE between the ground truth and the best 3D hypothesis generated by our network. The 3D Percentate of Correct Keypoints (3DPCK)  is adopted as the metric for the MPI-INF-3DHP dataset .
|Jahangiri et al. ||108.6||105.9||105.6||109.0||105.5||109.9||102.0||111.3||119.6||107.8||107.1||111.3||108.4||107.0||110.3||108.6|
|Martinez et al. ||57.4||61.6||64.3||65.6||73.3||85.5||61.0||62.1||84.0||101.1||68.2||66.7||70.8||55.6||59.6||69.1|
|Jahangiri et al. ||125.0||121.8||115.1||124.1||116.9||123.8||116.4||119.6||130.8||120.6||118.4||127.1||125.9||121.6||127.6||122.3|
|Martinez et al. ||62.9||66.9||69.9||71.4||80.2||93.8||66.3||65.9||90.6||109.7||74.2||72.1||75.5||61.7||65.7||75.1|
|Studio GS||Studio no GS||Outdoor||All PCK|
|Mehta et al. ||84.1||68.9||59.6||72.5|
|Number of kernels||1||3||5||8|
We first report our results on the Human3.6 dataset and compare with other state-of-the-art approaches. From the results shown in Table 1, we can see that our method outperforms the others in most cases. Our approach achieves an improvement of 5.5% compared to the previous best result 55.8 mm  and 16.2 % compared to the baseline architecture . This indicates the efficiency of our approach by generating multiple hypotheses. Moreover, our network outperforms  which also generates multiple hypotheses by 22.5%. Following previous work, We show our result under Protocol #2 [4, 17] where the estimated pose has been further aligned with the ground truth via a rigid transfermation. The MPJPE error in Table 1 shows that our approach consistently outperforms other approaches.
It is difficult to disambiguate the multiple 3D pose hypotheses generated by our model in a monocular view because most of them are feasible solutions to the inverse 2D-to-3D problem. Hence, we utilize the multi-view images from the set of calibrated cameras provided by the Human3.6M dataset to disambiguate and verify the correctness of the multiple 3D pose hypotheses generated by our network. Specifically, we transform the same pose under different cameras into the global world coordinates, and then we choose the pose which is most consistent with the poses from other camera coordinates. Finally, we get our estimated pose by averaging all poses from different camera coordinates. We list our result in Table 2 and compare with other state-of-the-art approaches based on multi-view  (spatial constraint) or video (temporal constraint) information [14, 10]. Note that it is however not a fair comparison with other results listed in Table 1 because they did not use any multi-view or video information. The results show that our approach has the best performance among both spatial and temporal constraints based methods, indicating the advantage of our approach by generating multiple hypotheses.
In realistic scenarios, it is common that some joints are occluded and cannot be detected. In order to show that our model can handle with missing joints, we ran experiments with different number of missing joints selected randomly from the limb joints including l/r elbow, l/r wrist, l/r knee, l/r ankle. We show our results in Table 3 and compared with the baseline 2D-to-3D estimator  and the GMM based methods  which also focus on generating multiple hypotheses. The baseline outperforms GMM based methods by a large margin, which indicates the advantage of using deep networks. Moreover, our method improves the baseline for all actions with average error decreased by 10.4mm for both cases, further showing the robustness of our method.
We test our method on the MPII and MPI-INF-3DHP datasets to validate the generalization capacity. Note that we train the feature extractor and hypotheses generator on the Human3.6 dataset which contains data from only the indoor environment. The validation set of MPI-INF-3DHP dataset includes images recorded under three different scenes: 1143 images in studio with green screen background (Studio GS), 1064 images in studio without green screen background (Studio no GS) and 728 images in outdoor environment (Outdoor). We use the 2D joints provided by the dataset as input and compute the 3DPCK. The results in Table 4 show that the 3DPCK of our approach is slightly lower than  even though we did not train on their dataset, indicating the generalization of our network. Moreover, our results for different scenes do not change too much compared to the results of . This further suggests the domain-invariant capability of the two-stage approach that we adopted. We only give qualitative results for the MPII dataset in Figure 3 since the ground truth 3D pose data is not provided. We can see that our network can be generalized well to outdoor unseen scenes.
Our hypotheses generator is based on MDN where each of the Gaussian kernels in Eqn.(1) yields different result. We note that our network cannot fit the data completely if is too small, while larger requires more computation resource. We thus train three different models with setting to 3, 5, 8, respectively. We show the average MPJPE on the Human3.6M dataset in Table 5 and compare them with the baseline method, which is based on single Gaussian distribution. The results suggest that our MDN has better performance than single Gaussion based method. Moreover, the performance does not improve much when is larger than five. Consequently, we set to five in our experiments in view of the trade-off between accuracy and computational complexity.
We add a Dirichlet conjugate prior to the distribution of the mixture coefficients ) to prevent overfitting of a single Gaussion kernel to the training data. In order to explore the role of the Dirichlet prior, we compare the performance of our model with and without . The results are shown in Table 6, it can be seen that the performance improves by adding the Dirichlet conjugate prior, especially for the difficult poses in actions “Sitting” and “SittingDown”. The reason is that most of the poses in the Human3.6 dataset are in a standing position, resulting in a worse performance on the “Sitting” and “SittingDown” actions. This further indicates that the Dirichlet conjugate prior can prevent overffiting effectively.
In order to explore the relation between different hypotheses, we reproject all five pose hypotheses into the image plane and compute the difference between projections and the 2D input joints. We adopt the PCKh@0.5 score  which is the standard metric for 2D pose estimation to measure the difference. The high PCKh@0.5 score in Table 7 suggests that all the five hypotheses have almost the same 2D reprojections which are consistent with the 2D input. Note that we do not add any constraint as  did to force all hypotheses to be consistent in the 2D reprojections.
We give several visualization results in Figure 4 to further illustrate the relations between all pose hypotheses. As described by Eqn. (3.1), each Gaussian kernel can be seen to generate the same hypotheses for simple pose with less ambiguity, e.g. standing (first row). This means that single Gaussian distribution is sufficient for simple poses. In comparison, our network can be seen to generate different hypotheses for challenging poses like “GettingDown” or “SittingDown” (second and third rows) due to two reasons. Firstly, our network receives lesser information on this type of poses since most of the poses in the Human3.6 dataset are the “standing” poses. Secondly, there are more ambiguities and occlusions for the “GettingDown” or “SittingDown” poses. As a result, our network generates multiple pose hypotheses to mitigate the increase of the uncertainty. We also visualize the 2D reprojections of all hypotheses in the last column. We indicate the corresponding 2D reprojection and 3D pose with the same color. The overlaps between the 2D reprojections further validate that our network generates hypotheses that are consistent in the 2D image coordinates.
In this work, we introduce the use of a mixture density network to generate multiple feasible hypotheses for the inverse problem of 3D human pose estimation from 2D inputs. Experimental results show that our network achieves state-of-the-art results in both best hypothesis and multi-view settings. Furthermore, the 3D pose hypotheses generated by our network are consistent in the 2D reprojections suggests that the hypotheses model the ambiguity along the depth of the joints. Results on the MPII and MPI-INF-3DHP datasets further show the generalization capacity of our network.
IEEE Conference on Computer Vision and Pattern Recognition, pages 1446–1455, 2015.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.In IEEE International Conference on Computer Vision, pages 1026–1034, 2015.