1 Introduction
3D human body shape and pose estimation from an RGB image is a challenging computer vision problem, partly due to its underconstrained nature wherein multiple 3D human bodies may explain a given 2D image, especially when the subject is significantly occluded, as is common for inthewild images. Several recent works
[55, 19, 26, 25, 47, 65, 12, 38, 14, 41, 40, 56, 36, 53] use deep neural networks to regress a single body shape and pose solution, which can result in impressive 3D body reconstructions given sufficient visual evidence in the input image. However, when visual evidence of the subject’s shape and pose is obscured, e.g. due to occluding objects or selfocclusions, a single solution does not fully describe the space of plausible 3D reconstructions. In contrast, we aim to estimate a structured probability distribution over 3D body shape and pose, conditioned on the input image, thereby allowing us to sample any number of plausible 3D reconstructions and quantify prediction uncertainty over the 3D body surface, as shown in Figure 1.We use the SMPL body model [33] to represent human shape and pose. Identitydependent body shape is parameterised by coefficients of a PCA basis  hence, a simple multivariate Gaussian distribution over the shape parameters is suitable. Body pose is parameterised by relative 3D joint rotations along the SMPL kinematic tree, which may be represented using rotation matrices. Regressing rotation matrices using neural networks is nontrivial, since they lie in , a nonlinear 3D manifold with a different topology to or
, the space in which unconstrained neural network outputs lie. However, one can define probability density functions over the Lie group
, such as the matrixFisher distribution [34, 11, 21], the parameter of which is an element of and may be easily regressed with a neural network [35]. We propose a hierarchical probability distribution over relative 3D joint rotations along the SMPL kinematic tree, wherein the probability density function of each joint’s relative rotation matrix is a matrixFisher distribution conditioned on the parents of that joint in the kinematic tree. We train a deep neural network to predict the parameters of such a distribution over body pose, alongside a Gaussian distribution over SMPL shape.Moreover, to ensure that 3D bodies sampled from the predicted distributions match the 2D input image, we implement a reprojection loss between predicted samples and groundtruth visible 2D joint annotations. To allow for the backpropagation of gradients through the sampling operation, we present a differentiable rejection sampler for matrixFisher distributions over relative 3D joint rotations.
Finally, a key obstacle for SMPL body shape regression from inthewild images is the lack of training datasets with accurate and diverse body shape labels [47]. To overcome this, we follow [47, 53, 41, 48] and utilise synthetic data, randomly generated onthefly during training. Inspired by [7], we use convolutional edge filters to close the large synthetictoreal gap and show that using edgebased inputs yields better performance than commonlyused silhouettebased inputs [47, 53, 48, 41], due to improved robustness and capacity to retain visual shape information.
In summary, our main contributions are as follows:

Given an input image, we predict a novel hierarchical matrixFisher distribution over relative 3D joint rotation matrices, whose structure is explicitly informed by the SMPL kinematic tree, alongside a Gaussian distribution over SMPL shape parameters.

We present a differentiable rejection sampler to sample any number of plausible 3D reconstructions and quantify prediction uncertainty over the body surface. This enables a reprojection loss between predicted samples and groundtruth coordinates of visible 2D joints, further ensuring that the predicted distributions are consistent with the input image.
2 Related Work
This section reviews approaches to monocular 3D human body shape and pose estimation, as well as deeplearningbased methods for probabilistic rotation estimation.
Monocular 3D shape and pose estimation
methods can be classified as optimisationbased or learningbased. Optimisationbased approaches fit a parametric 3D body model
[33, 1, 39, 18] to 2D observations, such as 2D keypoints [5, 29], silhouettes [29] or body part segmentations [63], by optimising a suitable cost function. These methods do not require expensive 3Dlabelled training data, but are sensitive to poor intialisations and noisy observations.Learningbased approaches can be further split into modelfree or modelbased. Modelfree methods use deep networks to directly output human body vertex meshes [26, 36, 65, 64, 8], voxel grids [56] or implicit surfaces [45, 46] from an input image. In contrast, modelbased methods [19, 47, 38, 12, 55, 14, 41, 40, 61] regress 3D body model parameters [39, 33, 18, 1], which give a lowdimensional representation of a 3D human body. To overcome the lack of inthewild 3Dlabelled training data, several methods [19, 61, 26, 12, 14] use diverse 2Dlabelled data as a source of weak supervision. [25] extends this approach by incorporating optimisation into their model training loop, lifting 2D labels to selfimproving 3D labels. These approaches often result in impressive 3D pose predictions, but struggle to accurately predict a diverse range of body shapes, since 2D keypoint supervision only provides a sparse shape signal. Shape prediction accuracy may be improved using synthetic training data [47, 53, 41, 48] consisting of synthetic input proxy representations (PRs) paired with groundtruth body shape and pose. PRs commonly consist of silhouettes and 2D joint heatmaps [47, 41, 48], necessitating accurate silhouette segmentations [24, 15] at testtime, which is not guaranteed for challenging inthewild inputs. Other methods [56] pretrain on synthetic RGB inputs [57] and then finetune on the scarce and limitedshapediversity real 3D training data available [16, 58], to avoid overfitting to artefacts in lowfidelity synthetic data. In contrast, we utilise edgebased PRs, hence dropping the reliance on accurate segmentation networks without requiring finetuning on real data or highfidelity synthetic data.
3D human shape and pose distribution estimation. Early optimisationbased 3D pose estimators [50, 51, 52, 9, 10]
specified a cost function corresponding to the posterior probability of 3D pose given 2D observations and analysed its multimodal structure due to illposedness. Strategies to sample multiple 3D poses with high posterior probability included costcovariancescaled
[50] and inversekinematicsbased [52] global search and local refinement, as well as costfunctionmodifying MCMC [51]. Recently, several learningbased methods [49, 31, 17, 59, 37] predict multimodal distributions over 3D joint locations conditioned on 2D inputs, using Bayesian mixture of experts [49], mixture density networks [31, 4, 37] or normalising flows [59, 44]. Our method extends beyond 3D joints and predicts distributions over human pose and shape. This has been addressed by Biggs [3], who predict a categorical distribution over a set of SMPL [33] parameter hypotheses. Sengupta [48]estimate an independent Gaussian distribution over both SMPL shape and joint rotation vectors. In contrast, we note that 3D rotations lie in
, motivating our hierarchical matrixFisher distribution.Rotation distribution estimation via deep learning. Prokudin [42] use biternion networks to predict a mixtureofvonMises distribution over object pose angle. Gilitschenski [13] use a Bingham distribution over unit quaternions to represent orientation uncertainty. However, these works have to enforce constraints on the parameters of their predicted distributions (e.g. positive semidefiniteness). To overcome this, Mohlin [35] train a deep network to regress a matrixFisher distribution [34, 11, 21] over 3D rotation matrices. We adapt this approach to define our hierarchical matrixFisher distribution over relative 3D joint rotation matrices.
3 Method
This section provides an overview of SMPL [33] and the matrixFisher distribution [11, 21, 34]
, presents our structured, hierarchical pose and shape distribution estimation architecture and discusses the loss functions used to train it.
3.1 SMPL model
SMPL [33] is a parametric 3D human body model. Identitydependent body shape is represented by shape parameters , which are coefficients of a PCA body shape basis. Body pose is defined by the relative 3D rotations of the bones formed by the 23 body (i.e. nonroot) joints in the SMPL kinematic tree. The rotations may be represented using rotation matrices , where . We parameterise the global rotation (i.e. rotation of the root joint) in axisangle form by . A differentiable function maps the input pose and shape parameters to an output vertex mesh . 3D joint locations, for joints of interest, are obtained as where is a linear vertextojoint regression matrix.
3.2 MatrixFisher distribution over
The 3D special orthogonal group may be defined as . The matrixFisher distribution [11, 21, 34] defines a probability density function over , given by
(1) 
where is the matrix parameter of the distribution, is the normalising constant and . We present some key properties of the matrixFisher distribution below, but refer the reader to [30, 35] for further details, visualisations and a method for approximating the intractable normalising constant and its gradient w.r.t. .
The properties of
can be described in terms of the singular value decomposition (SVD) of
, denoted by , with . and are orthonormal matrices, but they may have a determinant of 1 and thus are not necessarily elements of . Therefore, a proper SVD [30] is used, where(2)  
which ensures that . Then, the mode of the distribution is given by [30]
(3) 
The columns of define the distribution’s principal axes
of rotation (analogous to the principal axes of a multivariate Gaussian distribution), while the proper singular values in
give the concentration of the distribution for rotations about the principal axes [30]. Specifically, the concentration along rotations of about the th principal axis (th column of ) is given by for . The concentration of the distribution may be different about each principal axis, allowing for axisdependent rotation uncertainty modelling.3.3 Proxy representation computation
Given an input RGB image , we first compute a proxy representation (see Figure 2), consisting of an edgeimage concatenated with joint heatmaps. Comparisons with silhouette and RGBbased representations are given in Section 5.1. Edgeimages are obtained with Canny edge detection [6]. 2D joint heatmaps are computed using HRNetW48 [54], and joint predictions with low confidence scores () are thresholded out. The edgeimage and joint heatmaps are stacked along the channel dimension to produce . Proxy representations [47, 41] are used to close the domain gap between synthetic training images and real testtime RGB images, since synthetic proxy representations are more similar to their real counterparts than synthetic RGB images are to real RGB images.
3.4 Body shape and pose distribution prediction
Our goal is to predict a probability distribution over relative 3D joint rotations and SMPL shape parameters conditioned upon a given input proxy representation . We also predict deterministic estimates of the global body rotation and weakperspective camera parameters , representing scale and translation.
Since represents the linear coefficients of a PCA shapespace, a Gaussian distribution with a diagonal covariance matrix is suitable [48],
(4) 
where the mean
and variances
are functions of .The matrixFisher distribution (Equation 1) may be naively used to define a distribution over 3D joint rotations
(5) 
for . Here, each joint is modelled independently of all the other joints. Thus, the matrix parameter of the th joint, , is a function of the input only.
To predict the parameters of this naive, independent distribution over 3D joint rotations, in addition to the shape distribution parameters, global body rotation and weakperspective camera, we learn a function mapping the input to the set of desired outputs , where is represented by a deep neural network with weights .
However, the independent matrixFisher distribution in Equation 5 does not model SMPL 3D joint rotations faithfully, since the rotation of each part/bone is defined relative to its parent joint in the SMPL kinematic tree. Hence, a distribution over the th rotation matrix conditioned on the input should be informed by the distributions over all its parent joints , as well as the global body rotation , to enable the distribution to match the 2D visual pose evidence present in . Furthermore, 3D joints in the SMPL restpose skeleton are dependent upon the shape parameters , while the mapping from 3D to the 2D image plane is given by the camera model. Hence, a distribution over given should also consider the predicted shape mean and variance , as well as the predicted camera . This is similar to the rationale behind the deterministic iterative/hierarchical predictors in [19, 12], except we model these relationships in a probabilistic sense, by defining
(6)  
for . Now, the matrix parameter of the th joint is a function of all its parent distributions, represented by the principal axes , singular values and modes for , as well as the shape distribution , global rotation , camera parameters and the input . Note that the parent distributions are themselves functions of their respective parent joints, while and are all functions of .
To predict the parameters of the hierarchical matrixFisher distribution in Equation 6, we propose a hierarchical neural network architecture , with weights (Figure 2). When considered as a blackbox, yields the same set of outputs as . However, utilises the iterative hierarchical architecture presented in Figure 2, which amounts to multiple streams of fullyconnected layers, each following one “limb” of the kinematic tree. In contrast, predicts pose similarly to shape, camera and global rotation parameters, using a single stream of fullyconnected layers. We compare the naive independent formulation with the hierarchical formulation in Section 5.1.
3.5 Loss functions
Distribution prediction networks are trained with a synthetic dataset (Section 4).
Negative loglikelihood (NLL) loss on distribution parameters. The NLL corresponding to the Gaussian body shape distribution (Equation 4) is given by:
(7) 
The NLL corresponding to the matrixFisher distribution over relative 3D joint rotations is defined as [35]:
(8)  
for , where may be obtained via the independent or hierarchical matrixFisher models presented above. Intuitively, the trace term pushes the predicted distribution mode (Equation 3) towards the target , while the log normalising constant acts as a regulariser, preventing the singular values of from getting too large [35]. All predicted distribution parameters are dependent on the model weights, or , which are learnt in a maximum likelihood framework aiming to minimise the joint shape and pose NLL: .
Loss on global body rotation. We predict deterministic estimates of the global body rotation vectors , which are supervised using groundtruth global rotations , with loss . is the rotation matrix corresponding to .
2D joints loss on samples. Applying alone results in overly uncertain predicted 3D shape and pose distributions (see Section 5.1). To ensure that the predicted distributions match the visual evidence in the input , we impose a reprojection loss between groundtruth 2D joint coordinates (in the image plane) and predicted 2D joint samples, which are obtained by differentiably sampling 3D bodies from the predicted distributions and projecting to 2D using the predicted camera . Groundtruth 2D joints are computed from during synthetic training data generation (see Section 4).
We adapt the rejection sampler presented in [20] to sample from a matrixFisher distribution , modifying it to allow for backpropagation of gradients through the proposal sampling step (lines 57 in Algorithm 1). We refer the reader to [20] for further details about the rejection sampler. In short, to simulate a matrixFisher distribution with parameter we sample unit quaternions from a Bingham distribution [34] over the unit 3sphere , with Bingham parameter computed from , and then convert the sampled quaternions into rotation matrices [20, 34] with the desired matrixFisher distribution. Rejection sampling is used to sample from the Bingham distribution, which has pdf for . The proposal distribution for the rejection sampler is an angular central Gaussian (ACG) distribution, with pdf . The ACG distribution is easily simulated [20] by sampling from a zeromean Gaussian distribution with covariance matrix and normalising to unitlength (lines 57 in Algorithm 1). The reparameterisation trick [22] is used to differentiably sample from this zeromean Gaussian, thus allowing for backpropagation of gradients through the rejection sampler.
Algorithm 1 samples sets of relative 3D joint rotation matrices from the corresponding distributions . Furthermore, we differentiably sample SMPL shape vectors from the predicted Gaussian distribution , again using the reparameterisation trick [22].
The body shape and 3D joint rotation samples are converted into 2D joint samples using the SMPL model and weakperspective camera parameters
(9) 
where is an orthographic projection. The reprojection loss applied between the predicted 2D joint samples and the visible target 2D joint coordinates is given by
(10) 
where the visibilities of the target joints are denoted by (1 if visible, 0 otherwise).
4 Implementation Details
Synthetic training data. To train our 3D body shape and pose distribution prediction networks, we require a training dataset . We extend the synthetic training frameworks presented in [47, 48], which involve generating inputs and corresponding SMPL body shape and pose (i.e. 3D joint rotation) labels randomly and onthefly during training. In brief, for every training iteration, SMPL shapes are randomly sampled from a prior Gaussian distribution while relative 3D joint rotations and global rotation are chosen from the training sets of UP3D [29], 3DPW [58] or Human3.6M [16]. These are converted into training inputs and groundtruth 2D joint coordinates using the SMPL model and a lightweight renderer [43]. Cropping, occlusion and noise augmentations are then applied to the synthetic inputs.
Previous synthetic training frameworks [47, 48, 53] often use silhouettebased training inputs. This necessitates accurate human silhouette segmentation at testtime, which may be challenging to do robustly. In contrast, our input representations consist of edgeimages concatenated with 2D joint heatmaps. To generate edgeimages, we first create synthetic RGB images by rendering textured SMPL meshes. For each training mesh, clothing textures are randomly chosen from [57, 2]. The textured SMPL mesh is rendered onto a background image (randomly chosen from LSUN [62]), using randomlysampled lighting and camera parameters. Canny edge detection [6] is used to compute edgeimages from the synthetic RGB images. We show in Section 5.1 that, despite the lack of photorealism in the synthetic RGB images, edgefiltering bridges the synthetictoreal domain gap at testtime  and performs better than either silhouettebased or syntheticRGBbased training inputs in our experiments. Examples of synthetic training samples are given in the supplementary material.
Input Type  Architecture  2D Samples Loss  Synthetic Test Data  SSP3D  3DPW  

MPJPESC  PVETSC  2D Joint Err.  PVETSC  2D Joint Err.  MPJPESC  
Mode/Samples  Mode/Samples  
Silh. + J2DHmap  Independent  No  84.9  12.8  7.2 / 11.6  14.3  6.0 / 11.9  93.0 
RGB + J2DHmap  Independent  No  79.9  11.3  7.1 / 11.7  14.0  5.9 / 12.0  92.8 
Edge + J2DHmap  Independent  No  85.8  12.9  7.5 / 12.0  13.7  5.9 / 11.8  88.4 
Edge + J2DHmap  Independent  Yes  86.3  13.2  7.6 / 8.9  13.9  6.2 / 9.6  91.3 
Edge + J2DHmap  Hierarchical  No  84.4  12.8  7.3 / 10.4  13.6  5.3 / 11.2  87.7 
Edge + J2DHmap  Hierarchical  Yes  79.1  12.6  6.7 / 6.9  13.6  4.8 / 6.9  84.7 
Training details. We use Adam [23]
with a learning rate of 0.0001, batch size of 80 and train for 150 epochs. For stability, the 2D joints reprojection loss is only applied on the mode pose and shape (projected to 2D) in the first 50 epochs and not on the samples, which are supervised in the next 100 epochs. To boost 3D pose metrics, an MSE loss on the mode 3D joint locations is applied in the final 50 epochs.
Evaluation datasets. 3DPW [58] is used to evaluate 3D pose prediction accuracy. We report meanperjointpositionerror after scale correction (MPJPESC) [47] and after Procrustes analysis (MPJPEPA), both in mm. Both metrics are computed using the mode 3D joint coordinates of the predicted shape and pose distributions.
SSP3D is primarily used to evaluate 3D body shape prediction accuracy, using pervertex Euclidean error in a Tpose after scalecorrection (PVETSC) [47] in mm, computed with the mode 3D body shape from the predicted shape distribution. We also evaluate 2D joint prediction error (2D Joint Err. Mode/Samples) in pixels, computed using both the mode 3D body and 10 3D bodies randomly sampled from the predicted shape and pose distributions, projected onto the image plane using the camera prediction. 2D joint error is evaluated on visible target 2D joints only.
Finally, we use a synthetic test dataset for our ablation studies investigating different input representations. It consists of 1000 synthetic inputlabel pairs, generated in the same way as the synthetic training data, with poses sampled from the test set of Human3.6M. [16].
5 Experimental Results
This section investigates different input representations and the benefits of the 2D joints samples loss, compares independent and hierarchical distribution predictors and benchmarks our method against the stateoftheart.
5.1 Ablation studies
Input proxy representation. Rows 13 in Table 1 compare different choices of input proxy representation: binary silhouettes, RGB images and edgefiltered images (each additionally concatenated with 2D joint heatmaps). The independent network architecture is used for all three input types. To investigate the synthetictoreal domain gap, metrics are presented for synthetic test data, as well as real test images from SSP3D and 3DPW. For the latter, silhouette segmentation is carried out with DensePose [15]. Using RGBbased input representations (row 2) results in the best 3D shape and pose metrics on synthetic data, which is reasonable since RGB contains more information than both silhouettes and edgefiltered images. However, metrics are significantly worse on real datasets, suggesting that the network has overfitted to unrealistic artefacts present in lowfidelity (i.e. computationally cheap) synthetic RGB images. Silhouettebased input representations (row 1) also demonstrate a deterioration of 3D metrics on real test data compared to synthetic data, since they are heavily reliant upon accurate silhouettes, which are difficult to robustly segment in test images containing challenging poses or severe occlusions. Inaccurate silhouette segmentations critically impair the network’s ability to predict 3D body pose and shape. In contrast, edgefiltering is a simpler and more robust operation than segmentation, but is still able to retain important shape information from the RGB image. Thus, edgeimages (concatenated with 2D joint heatmaps) can better bridge the synthetictoreal domain gap, resulting in improved metrics on real test inputs (row 3).
Hierarchical architecture and reprojection loss on 2D joints samples. Figure 3 and rows 36 in Table 1 compare the independent and hierarchical distribution prediction architectures ( and ) presented in Section 3.4, both with and without the reprojection loss on sampled 2D joints () from Section 3.5. When is not applied, the shape and pose distributions predicted by both the independent and hierarchical network architectures do not consistently match the the input image, as evidenced by the significant gap between the visible 2D joint error computed using the distributions’ modes versus samples drawn from the distributions (in rows 3 and 5 of Table 1) on both synthetic test data and SSP3D [47]. This implies that the predicted distributions are overly uncertain about parts of the subject’s body that are visible and unambiguous in the input image. The visualisations corresponding to the hierarchical architecture trained without in Figure 3 (centre) further demonstrate that the predicted samples often do not match the input image, particularly at the extreme ends of the body. This results in significant undesirable pervertex uncertainty over unambiguous body parts.
Applying to the independent network partially alleviates the mismatch between inputs and predicted samples, as shown by Figure 3 (right) and row 4 in Table 1, where the mode versus sample 2D joint error gap has reduced. However, training with deteriorates the independent architecture’s mode pose prediction metrics (MPJPESC and 2D Joint Err. Mode in row 3 vs 4 of Table 1) on both synthetic and real test data. This is because naively models each joint’s relative rotation independently of its parents’ rotations (Equation 5); however, to predict realistic human pose samples that match the visible input, each joint’s rotation distribution must be informed by its parents. attempts to force predicted samples to match the input despite this logical inconsistency, which causes a tradeoff between mode and sample pose prediction metrics, particularly worsening MPJPESC.
In contrast, applying to the hierarchical network improves metrics corresponding to both mode and sample predictions, as shown by row 6 in Table 1. Now, each SMPL joint’s relative rotation distribution is conditioned on all its parents’ distributions (Equation 6). Thus, and work in conjunction in enabling predicted hierarchical distributions (and samples) to match the visible input, while yielding improved 3D metrics. Figure 3 (left) exhibits such visuallyconsistent samples and demonstrates greater prediction uncertainty for ambiguous parts. Note that uncertainty can arise even without occlusion in a monocular setting, e.g. due to depth ambiguities [50, 52] as shown by the left arm samples in the last row of Figure 3. Further visual results are in the supplementary material.
Method  3DPW  

MPJPE  MPJPESC  MPJPEPA  
HMR [19]  130.0  102.8  76.7 
GraphCMR [26]  119.9  102.0  70.2 
SPIN [25]  96.9  89.4  59.0 
Pose2Mesh [8]  89.2    58.9 
I2LMeshNet [36]  93.2  77.5  57.7 
Biggs [3]  93.8    59.9 
DaNet [65]  85.5  76.4  54.8 
HybrIK [32]  80.0    48.8 
HMR (unpaired) [19]    126.3  92.0 
Kundu [28]  153.4    89.8 
STRAPS [47]    99.0  66.8 
Sengupta [48]    90.9  61.0 
Ours w. Detectron2 [60]  96.2  84.7  59.2 
Ours w. HRNetW48 [54]  84.9  73.0  53.6 
Max. input set size  Method  SSP3D 
PVETSC  
HMR [19]  22.9  
GraphCMR [26]  19.5  
1  SPIN [25]  22.2 
DaNet [65]  22.1  
STRAPS [47]  15.9  
Sengupta [48]  15.2  
Ours  13.6  
HMR [19] + Mean  22.9  
GraphCMR [26] + Mean  19.3  
SPIN [25] + Mean  21.9  
5  DaNet [65] + Mean  22.1 
STRAPS [47] + Mean  14.4  
Sengupta [48] + Mean  13.6  
Sengupta [48] + Prob. Comb.  13.3  
Ours + Mean  12.2  
Ours + Prob. Comb.  12.0 
5.2 Comparison with the stateoftheart
Shape prediction. Table 3 evaluates 3D body shape metrics on SSP3D [47] for single image inputs and multiimage input sets, which we evaluate using both mean and probabilistic combination methods from [48]. Our network surpasses the stateoftheart [48], mainly due to our use of an edgebased proxy representation, instead of the silhouettebased representations used in [47] and [48]. These methods rely on accurate human silhouettes, which may be difficult to compute at testtime, as discussed in Section 5.1, while our method does not have such dependencies. However, our method may result in erroneous shape predictions when the subject is wearing loose clothing which obscures body shape, in which case the shape prediction overestimates the subject’s true proportions (see rows 12 in Figure 3).
Pose prediction. Table 2 evaluates 3D pose metrics on 3DPW [58]. Our method is competitive with the stateoftheart and surpasses other methods that do not require 3Dlabelled training images [47, 48, 28, 19]. Figure 4(a) shows that our method performs well for most test examples in 3DPW, even matching posefocused approaches that do not attempt to accurately predict diverse body shapes [36, 25]. However, some images in 3DPW contain significant occlusion, which can lead to noisy 2D joint heatmaps in the proxy representations, resulting in poor 3D pose metrics as shown by the right end of the curve in Figure 4(a).
Further quantitative comparison with other shape and pose distribution/multihypothesis prediction approaches is given in the supplementary material.
6 Conclusion
In this paper, we have proposed a probabilistic approach to the illposed problem of monocular 3D human shape and pose estimation, motivated by the fact that multiple 3D bodies may explain a given 2D image. Our method predicts a novel hierarchical matrixFisher distribution over relative 3D joint rotations and a Gaussian distribution over SMPL [33] shape parameters, from which we can sample any number of plausible 3D reconstructions. To ensure that the predicted distributions match the input image, we have implemented a differentiable rejection sampler to impose a loss between predicted 2D joint samples and groundtruth 2D joint coordinates. Our method is competitive with the stateoftheart in terms of pose metrics on 3DPW, while surpassing the stateoftheart for shape accuracy on SSP3D.
Acknowledgements. We thank Dr. Yu Chen (Metail), Mr. Jim Downing (Metail), Dr. David Bruner (SizeStream) and Dr. Delman Lee (TAL Apparel) for providing body shape evaluation data and supporting this research.
Supplementary Material: Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild
Section 7 in this supplementary material contains implementation details, particularly regarding synthetic training data generation and pervertex uncertainty visualisation. Section 8 discusses qualitative results on the SSP3D [47] and 3DPW [58] datasets, and compares distribution predictions on images with versus without artificial occlusions. Table 5 compares several recent multihypothesis 3D human shape and pose estimation approaches.
7 Implementation Details
7.1 Synthetic Training Data
Our shape and pose distribution prediction neural networks are trained using synthetic training data, consisting of edgeandjointheatmap inputs paired with ground truth SMPL [33] shape and pose parameters. Inputs are rendered onthefly during model training using randomly sampled camera extrinsics, lighting, backgrounds and clothing textures. Examples of synthetic training and validation data are given in Figure 5. Note how each body pose may be paired with a different body shape, clothing, camera and background, as well as occlusion and noise augmentations. Thus, we are able to render highly diverse training data onthefly during training, enabling the network to see a new pose/shape/clothing/camera/background combination in each training iteration.
Our synthetic RGB images (Figure 5) are computationally cheap but clearly far from photorealistic, resulting in a large synthetictoreal domain gap. However, simple edge detection [6] is able to significantly reduce this gap [7], motivating the use of edgefiltered images as part of our input proxy representation. We found that noisy edge detections (as seen in Figure 5
) retained sufficient visual shape and pose information, and efforts to produce clean edgeimages (e.g. hysteresisbased edge tracking or further hyperparameter tuning) did not improve performance.
The required body shape, pose, clothing and backgrounds are obtained as follows. For training, groundtruth SMPL 3D joint rotation matrices are sampled from the training splits of 3DPW [58] and UP3D [29], as well as Human3.6M [16] subjects 1, 5, 6, 7 and 8, giving a total of 91106 training poses. Validation poses are sampled from the 3DPW/UP3D validation splits and Human3.6M subjects 9 and 11, resulting in 33347 validation poses. SMPL body shape parameters are randomly sampled from for [47]. RGB clothing textures for the SMPL body mesh are selected from SURREAL [57] and MultiGarmentNet [2], resulting in 917 training textures and 108 validation textures. Backgrounds are obtained from LSUN [62], which contains a collection of diverse indoor and outdoor scenes. We sample from 397582 different training backgrounds and 3000 different validation backgrounds. Note that background training images may contain other humans, which is intentional and essential for robustness against test images with multiple people. The network learns to focus on the person corresponding to the input joint heatmaps and ignore persons in the background.
Textured SMPL meshes are rendered with Pytorch3D [43], using a perspective camera model and Phong shading. Camera and lighting parameters are randomly sampled, with sampling hyperparameters given in Table 4. Generated images are cropped around the rendered body using a square bounding box, where the bounding box size is randomly scaled by a factor in range (0.8, 1.2).
To further bridge the gap synthetictoreal gap, we implement random occlusion, body part removal, 2D joint removal and 2D joint noise augmentations during training. Hyperparameters associated with data augmentations are given in Table 6.
Hyperparameter  Value 

Shape parameter sampling mean  0 
Shape parameter sampling std.  1.25 
Cam. translation sampling mean  (0, 0.2, 2.5) m 
Cam. translation sampling var.  (0.05, 0.05, 0.25) m 
Cam. focal length  300.0 
Lighting ambient intensity range  [0.4, 0.8] 
Lighting diffuse intensity range  [0.4, 0.8] 
Lighting specular intensity range  [0.0, 0.5] 
Bounding box scale factor range  [0.8, 1.2] 
Proxy representation dimensions  pixels 
7.2 Visualisation of PerVertex Uncertainty
Figures 6, 7 and 8 in this supplementary material, as well as several figures in the main manuscript, visualise pervertex 3D location uncertainties corresponding to the predicted shape and 3D joint rotation distributions. These are computed by i) sampling 100 shape parameter vectors and relative 3D joint rotations (for the entire kinematic tree) from the predicted distributions, ii) passing each of these samples through the SMPL function [33] to get the corresponding vertex meshes, iii) computing the mean location of each vertex over all the samples and iv) determining the average Euclidean distance from the sample mean for each vertex over all the samples, which is ultimately visualised in the vertex scatter plots as a measure of pervertex 3D location uncertainty.
Method  3DPW  SSP3D  

MPJPE  MPJPEPA  PVETSC  
Number of Samples:  1  5  10  25  1  5  10  25  1  5  10  25 
Biggs [3]  93.8  82.2  79.4  75.8  59.9  57.1  56.6  55.6         
Sengupta [48]  97.1  95.8  93.1  89.7  61.1  59.4  58.2  56.5  15.2  14.8  13.6  11.9 
ProHMR [27]          59.8  56.5  54.6  52.4         
Ours (Independent) w. HRNet [54]  88.3  85.0  82.6  78.5  56.6  54.5  52.8  50.2  13.9  12.9  12.0  10.3 
Ours (Hierarchical) w. HRNet [54]  84.9  81.6  79.0  75.1  53.6  51.4  49.6  47.0  13.6  12.3  11.3  9.8 
Augmentation  Hyperparameter  Value 

Body part occlusion  Occlusion probability  0.1 
2D joints L/R swap  Swap probability  0.1 
Halfimage occlusion  Occlusion probability  0.05 
2D joints removal  Removal probability  0.1 
2D joints noise  Noise range  [8, 8] pixels 
Occlusion box  Probability, Size  0.5, 48 pixels 
8 Qualitative Results
Figure 7 presents results on artificially occluded images from SSP3D [47]. In particular, note that i) occluded/invisible body parts result in increased 3D location uncertainty for corresponding vertices and ii) 3D body samples from the predicted distributions match the visible body parts in the 2D image, while invisible body part samples are more diverse. However, occluded sample diversity is still somewhat limited and samples tend to be clustered around the mode predictions, which is a weakness of our method. This may be alleviated by predicting multimodal distributions over 3D shape and pose in future work. Figure 7 also illustrates our method’s ability to predict a range of body shapes, owing to the synthetic training framework used.
Figure 6 presents results on the test split of 3DPW [58]. Again, note the increased uncertainty and sample diversity for occluded and outofframe body parts, and the reprojection consistency between predicted samples and the visible bodies in the images. Results on 3DPW highlight another key challenge for future work: when faced with baggy/loose clothing, our method tends to overestimate the subject’s body proportions. This is because our synthetic training data does not model the shape of clothing on the human body surface, but only its texture. Future work could focus on using synthetic clothed humans for training.
Figure 8 compares shape and pose distribution predictions on images from SSP3D with versus without artificial occlusions, further corroborating that ambiguous parts result in greater uncertainty and more diverse 3D samples. However, it is again apparent that sample diversity for highly ambiguous parts is more limited than expected, as samples tend to be closely clustered around the mode prediction.
References
 [1] (2005) SCAPE: shape completion and animation of people. In ACM Transactions on Graphics (TOG)  Proceedings of SIGGRAPH, Vol. 24, pp. 408–416. Cited by: §2, §2.
 [2] (201910) Multigarment net: learning to dress 3D people from images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §4, §7.1.
 [3] (2020) 3D multibodies: fitting sets of plausible 3D models to ambiguous image data. In NeurIPS, Cited by: §2, Table 2, Table 5.
 [4] (1994) Mixture density networks. Technical report . Cited by: §2.
 [5] (201610) Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.
 [6] (1986) A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 8 (6), pp. 679–698. Cited by: §3.3, §4, Figure 5, §7.1.
 [7] (2020) Realtime screen reading: reducing domain shift for oneshot learning. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §1, §7.1.
 [8] (2020) Pose2Mesh: graph convolutional network for 3D human pose and mesh recovery from a 2D human pose. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2, Table 2.
 [9] (2001) People tracking using hybrid monte carlo filtering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Vol. 2, pp. 321–328 vol.2. External Links: Document Cited by: §2.

[10]
(2000)
Articulated body motion capture by annealed particle filtering.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, External Links: Document Cited by: §2.  [11] (197212) Orientation statistics. Biometrika 59 (3), pp. 665–676. External Links: ISSN 00063444, Document, Link, https://academic.oup.com/biomet/articlepdf/59/3/665/1311005/593665.pdf Cited by: §1, §2, §3.2, §3.
 [12] (2020) Hierarchical kinematic human mesh recovery. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1, §2, §3.4.
 [13] (2020) Deep orientation uncertainty learning based on a bingham loss. In International Conference on Learning Representations, Cited by: §2.
 [14] (201906) HoloPose: holistic 3D human reconstruction inthewild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
 [15] (2018) DensePose: dense human pose estimation in the wild. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §5.1, Table 6.
 [16] (201407) Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 36 (7), pp. 1325–1339. Cited by: §2, §4, §4, §7.1.
 [17] (2017) Generating multiple diverse hypotheses for human 3D pose consistent with 2D joint detections. In IEEE International Conference on Computer Vision (ICCV) Workshops (PeopleCap), Cited by: §2.
 [18] (201806) Total capture: a 3D deformation model for tracking faces, hands, and bodies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §2.
 [19] (2018) Endtoend recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.4, §5.2, Table 2, Table 3.
 [20] (2013) A new method to simulate the Bingham and related distributions in directional data analysis with applications. External Links: 1310.8110 Cited by: §3.5.
 [21] (1977) The Von MisesFisher matrix distribution in orientation statistics. Journal of the Royal Statistical Society. Series B (Methodological) 39 (1), pp. 95–106. External Links: ISSN 00359246, Link Cited by: §1, §2, §3.2, §3.
 [22] (2014) Autoencoding variational bayes. External Links: 1312.6114 Cited by: §3.5, §3.5.
 [23] (2014) Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.
 [24] (2020) PointRend: image segmentation as rendering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [25] (2019) Learning to reconstruct 3D human pose and shape via modelfitting in the loop. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, §5.2, Table 2, Table 3.
 [26] (2019) Convolutional mesh regression for singleimage human shape reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, Table 2, Table 3.
 [27] (2021) Probabilistic modeling for human mesh recovery. In ICCV, Cited by: Table 5.
 [28] (2020) Appearance consensus driven selfsupervised human mesh recovery. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §5.2, Table 2.
 [29] (2017) Unite the People: closing the loop between 3D and 2D human representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4, §7.1.
 [30] (2018) Bayesian attitude estimation with the matrix fisher distribution on so(3). IEEE Transactions on Automatic Control 63 (10), pp. 3377–3392. External Links: Document Cited by: §3.2, §3.2.
 [31] (201906) Generating multiple hypotheses for 3D human pose estimation with mixture density network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [32] (2021) HybrIK: a hybrid analyticalneural inverse kinematics solution for 3d human pose and shape estimation. In CVPR, Cited by: Table 2.
 [33] (2015) SMPL: a skinned multiperson linear model. In ACM Transactions on Graphics (TOG)  Proceedings of ACM SIGGRAPH Asia, Vol. 34, pp. 248:1–248:16. Cited by: Figure 2, §1, §2, §2, §2, §3.1, §3, §6, §7.1, §7.2.
 [34] (2000) Directional statistics. Wiley. Cited by: §1, §2, §3.2, §3.5, §3.
 [35] (2020) Probabilistic orientation estimation with matrix fisher distributions. In Advances in Neural Information Processing Systems, Vol. 33. Cited by: §1, §2, §3.2, §3.5.
 [36] (2020) I2LMeshNet: imagetolixel prediction network for accurate 3D human pose and mesh estimation from a single rgb image. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1, §2, §5.2, Table 2.
 [37] (2020) GraphMDN: leveraging graph structure and deep learning to solve inverse problems. CoRR abs/2010.13668. External Links: Link, 2010.13668 Cited by: §2.
 [38] (2018) Neural body fitting: unifying deep learning and modelbased human pose and shape estimation. In Proceedings of the International Conference on 3D Vision (3DV), Cited by: §1, §2.
 [39] (2019) Expressive body capture: 3D hands, face, and body from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §2.
 [40] (2019) TexturePose: supervising human mesh estimation with texture consistency. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.
 [41] (2018) Learning to estimate 3D human pose and shape from a single color image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2, §3.3.
 [42] (201809) Deep directional statistics: pose estimation with uncertainty quantification. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.
 [43] (2020) Accelerating 3D deep learning with PyTorch3D. arXiv:2007.08501. Cited by: §4, §7.1.

[44]
(201507–09 Jul)
Variational inference with normalizing flows.
In
Proceedings of the International Conference on Machine Learning
, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 1530–1538. Cited by: §2.  [45] (201910) PIFu: pixelaligned implicit function for highresolution clothed human digitization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
 [46] (202006) PIFuHD: multilevel pixelaligned implicit function for highresolution 3D human digitization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [47] (202009) Synthetic training for accurate 3D human pose and shape estimation in the wild. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: 3rd item, §1, §1, §2, §3.3, Table 1, §4, §4, §4, §4, §5.1, §5.2, §5.2, Table 2, Table 3, §6, §7.1, Figure 7, Figure 8, §8.
 [48] (2021) Probabilistic 3D human shape and pose estimation from multiple unconstrained images in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: 3rd item, §1, §2, §2, §3.4, §4, §4, §5.2, §5.2, Table 2, Table 3, Table 5.
 [49] (2005) Discriminative density propagation for 3D human motion estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 390–397 vol. 1. External Links: Document Cited by: §2.
 [50] (2001) Covariance scaled sampling for monocular 3D body tracking.. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Document Cited by: §2, §5.1, §8.
 [51] (2002) Hyperdynamics importance sampling. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.
 [52] (2003) Kinematic jump processes for monocular 3D human tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Document Cited by: §2, §5.1, §8.
 [53] (2019) Towards accurate 3D human body reconstruction from silhouettes. In Proceedings of the International Conference on 3D Vision (3DV), External Links: Document Cited by: §1, §1, §2, §4.
 [54] (2019) Deep highresolution representation learning for human pose estimation. In CVPR, Cited by: §3.3, Table 2, Table 5.
 [55] (2017) Indirect deep structured learning for 3D human shape and pose prediction. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §1, §2.
 [56] (2018) BodyNet: volumetric inference of 3D human body shapes. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1, §2.
 [57] (2017) Learning from synthetic humans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4, §7.1.
 [58] (2018) Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2, Table 1, §4, §4, §5.2, Table 2, §6, Figure 6, §7.1, §8.
 [59] (2021) Probabilistic monocular 3D human pose estimation with normalizing flows. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
 [60] (2019) Detectron2. Note: https://github.com/facebookresearch/detectron2 Cited by: Table 2.
 [61] (2019) DenseRaC: joint 3D pose and shape estimation by dense renderandcompare. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
 [62] (2015) LSUN: construction of a largescale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365. Cited by: §4, §7.1.
 [63] (2018) Monocular 3D pose and shape estimation of multiple people in natural scenes  the importance of multiple scene constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [64] (202006) 3D human mesh regression with dense correspondence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [65] (2019) DaNet: decomposeandaggregate network for 3D human shape and pose estimation. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 935–944. Cited by: §1, §2, Table 2, Table 3.
Comments
There are no comments yet.