3D human body shape and pose estimation from an RGB image is a challenging computer vision problem, partly due to its under-constrained nature wherein multiple 3D human bodies may explain a given 2D image, especially when the subject is significantly occluded, as is common for in-the-wild images. Several recent works[55, 19, 26, 25, 47, 65, 12, 38, 14, 41, 40, 56, 36, 53] use deep neural networks to regress a single body shape and pose solution, which can result in impressive 3D body reconstructions given sufficient visual evidence in the input image. However, when visual evidence of the subject’s shape and pose is obscured, e.g. due to occluding objects or self-occlusions, a single solution does not fully describe the space of plausible 3D reconstructions. In contrast, we aim to estimate a structured probability distribution over 3D body shape and pose, conditioned on the input image, thereby allowing us to sample any number of plausible 3D reconstructions and quantify prediction uncertainty over the 3D body surface, as shown in Figure 1.
We use the SMPL body model  to represent human shape and pose. Identity-dependent body shape is parameterised by coefficients of a PCA basis - hence, a simple multivariate Gaussian distribution over the shape parameters is suitable. Body pose is parameterised by relative 3D joint rotations along the SMPL kinematic tree, which may be represented using rotation matrices. Regressing rotation matrices using neural networks is non-trivial, since they lie in , a non-linear 3D manifold with a different topology to or
, the space in which unconstrained neural network outputs lie. However, one can define probability density functions over the Lie group, such as the matrix-Fisher distribution [34, 11, 21], the parameter of which is an element of and may be easily regressed with a neural network . We propose a hierarchical probability distribution over relative 3D joint rotations along the SMPL kinematic tree, wherein the probability density function of each joint’s relative rotation matrix is a matrix-Fisher distribution conditioned on the parents of that joint in the kinematic tree. We train a deep neural network to predict the parameters of such a distribution over body pose, alongside a Gaussian distribution over SMPL shape.
Moreover, to ensure that 3D bodies sampled from the predicted distributions match the 2D input image, we implement a reprojection loss between predicted samples and ground-truth visible 2D joint annotations. To allow for the backpropagation of gradients through the sampling operation, we present a differentiable rejection sampler for matrix-Fisher distributions over relative 3D joint rotations.
Finally, a key obstacle for SMPL body shape regression from in-the-wild images is the lack of training datasets with accurate and diverse body shape labels . To overcome this, we follow [47, 53, 41, 48] and utilise synthetic data, randomly generated on-the-fly during training. Inspired by , we use convolutional edge filters to close the large synthetic-to-real gap and show that using edge-based inputs yields better performance than commonly-used silhouette-based inputs [47, 53, 48, 41], due to improved robustness and capacity to retain visual shape information.
In summary, our main contributions are as follows:
Given an input image, we predict a novel hierarchical matrix-Fisher distribution over relative 3D joint rotation matrices, whose structure is explicitly informed by the SMPL kinematic tree, alongside a Gaussian distribution over SMPL shape parameters.
We present a differentiable rejection sampler to sample any number of plausible 3D reconstructions and quantify prediction uncertainty over the body surface. This enables a reprojection loss between predicted samples and ground-truth coordinates of visible 2D joints, further ensuring that the predicted distributions are consistent with the input image.
2 Related Work
This section reviews approaches to monocular 3D human body shape and pose estimation, as well as deep-learning-based methods for probabilistic rotation estimation.
Monocular 3D shape and pose estimation
methods can be classified as optimisation-based or learning-based. Optimisation-based approaches fit a parametric 3D body model[33, 1, 39, 18] to 2D observations, such as 2D keypoints [5, 29], silhouettes  or body part segmentations , by optimising a suitable cost function. These methods do not require expensive 3D-labelled training data, but are sensitive to poor intialisations and noisy observations.
Learning-based approaches can be further split into model-free or model-based. Model-free methods use deep networks to directly output human body vertex meshes [26, 36, 65, 64, 8], voxel grids  or implicit surfaces [45, 46] from an input image. In contrast, model-based methods [19, 47, 38, 12, 55, 14, 41, 40, 61] regress 3D body model parameters [39, 33, 18, 1], which give a low-dimensional representation of a 3D human body. To overcome the lack of in-the-wild 3D-labelled training data, several methods [19, 61, 26, 12, 14] use diverse 2D-labelled data as a source of weak supervision.  extends this approach by incorporating optimisation into their model training loop, lifting 2D labels to self-improving 3D labels. These approaches often result in impressive 3D pose predictions, but struggle to accurately predict a diverse range of body shapes, since 2D keypoint supervision only provides a sparse shape signal. Shape prediction accuracy may be improved using synthetic training data [47, 53, 41, 48] consisting of synthetic input proxy representations (PRs) paired with ground-truth body shape and pose. PRs commonly consist of silhouettes and 2D joint heatmaps [47, 41, 48], necessitating accurate silhouette segmentations [24, 15] at test-time, which is not guaranteed for challenging in-the-wild inputs. Other methods  pre-train on synthetic RGB inputs  and then fine-tune on the scarce and limited-shape-diversity real 3D training data available [16, 58], to avoid over-fitting to artefacts in low-fidelity synthetic data. In contrast, we utilise edge-based PRs, hence dropping the reliance on accurate segmentation networks without requiring fine-tuning on real data or high-fidelity synthetic data.
specified a cost function corresponding to the posterior probability of 3D pose given 2D observations and analysed its multi-modal structure due to ill-posedness. Strategies to sample multiple 3D poses with high posterior probability included cost-covariance-scaled and inverse-kinematics-based  global search and local refinement, as well as cost-function-modifying MCMC . Recently, several learning-based methods [49, 31, 17, 59, 37] predict multi-modal distributions over 3D joint locations conditioned on 2D inputs, using Bayesian mixture of experts , mixture density networks [31, 4, 37] or normalising flows [59, 44]. Our method extends beyond 3D joints and predicts distributions over human pose and shape. This has been addressed by Biggs , who predict a categorical distribution over a set of SMPL  parameter hypotheses. Sengupta 
estimate an independent Gaussian distribution over both SMPL shape and joint rotation vectors. In contrast, we note that 3D rotations lie in, motivating our hierarchical matrix-Fisher distribution.
Rotation distribution estimation via deep learning. Prokudin  use biternion networks to predict a mixture-of-von-Mises distribution over object pose angle. Gilitschenski  use a Bingham distribution over unit quaternions to represent orientation uncertainty. However, these works have to enforce constraints on the parameters of their predicted distributions (e.g. positive semi-definiteness). To overcome this, Mohlin  train a deep network to regress a matrix-Fisher distribution [34, 11, 21] over 3D rotation matrices. We adapt this approach to define our hierarchical matrix-Fisher distribution over relative 3D joint rotation matrices.
, presents our structured, hierarchical pose and shape distribution estimation architecture and discusses the loss functions used to train it.
3.1 SMPL model
SMPL  is a parametric 3D human body model. Identity-dependent body shape is represented by shape parameters , which are coefficients of a PCA body shape basis. Body pose is defined by the relative 3D rotations of the bones formed by the 23 body (i.e. non-root) joints in the SMPL kinematic tree. The rotations may be represented using rotation matrices , where . We parameterise the global rotation (i.e. rotation of the root joint) in axis-angle form by . A differentiable function maps the input pose and shape parameters to an output vertex mesh . 3D joint locations, for joints of interest, are obtained as where is a linear vertex-to-joint regression matrix.
3.2 Matrix-Fisher distribution over
where is the matrix parameter of the distribution, is the normalising constant and . We present some key properties of the matrix-Fisher distribution below, but refer the reader to [30, 35] for further details, visualisations and a method for approximating the intractable normalising constant and its gradient w.r.t. .
The properties of
can be described in terms of the singular value decomposition (SVD) of, denoted by , with . and are orthonormal matrices, but they may have a determinant of -1 and thus are not necessarily elements of . Therefore, a proper SVD  is used, where
which ensures that . Then, the mode of the distribution is given by 
The columns of define the distribution’s principal axes
of rotation (analogous to the principal axes of a multivariate Gaussian distribution), while the proper singular values ingive the concentration of the distribution for rotations about the principal axes . Specifically, the concentration along rotations of about the -th principal axis (-th column of ) is given by for . The concentration of the distribution may be different about each principal axis, allowing for axis-dependent rotation uncertainty modelling.
3.3 Proxy representation computation
Given an input RGB image , we first compute a proxy representation (see Figure 2), consisting of an edge-image concatenated with joint heatmaps. Comparisons with silhouette- and RGB-based representations are given in Section 5.1. Edge-images are obtained with Canny edge detection . 2D joint heatmaps are computed using HRNet-W48 , and joint predictions with low confidence scores () are thresholded out. The edge-image and joint heatmaps are stacked along the channel dimension to produce . Proxy representations [47, 41] are used to close the domain gap between synthetic training images and real test-time RGB images, since synthetic proxy representations are more similar to their real counterparts than synthetic RGB images are to real RGB images.
3.4 Body shape and pose distribution prediction
Our goal is to predict a probability distribution over relative 3D joint rotations and SMPL shape parameters conditioned upon a given input proxy representation . We also predict deterministic estimates of the global body rotation and weak-perspective camera parameters , representing scale and translation.
Since represents the linear coefficients of a PCA shape-space, a Gaussian distribution with a diagonal covariance matrix is suitable ,
where the mean
and variancesare functions of .
The matrix-Fisher distribution (Equation 1) may be naively used to define a distribution over 3D joint rotations
for . Here, each joint is modelled independently of all the other joints. Thus, the matrix parameter of the -th joint, , is a function of the input only.
To predict the parameters of this naive, independent distribution over 3D joint rotations, in addition to the shape distribution parameters, global body rotation and weak-perspective camera, we learn a function mapping the input to the set of desired outputs , where is represented by a deep neural network with weights .
However, the independent matrix-Fisher distribution in Equation 5 does not model SMPL 3D joint rotations faithfully, since the rotation of each part/bone is defined relative to its parent joint in the SMPL kinematic tree. Hence, a distribution over the -th rotation matrix conditioned on the input should be informed by the distributions over all its parent joints , as well as the global body rotation , to enable the distribution to match the 2D visual pose evidence present in . Furthermore, 3D joints in the SMPL rest-pose skeleton are dependent upon the shape parameters , while the mapping from 3D to the 2D image plane is given by the camera model. Hence, a distribution over given should also consider the predicted shape mean and variance , as well as the predicted camera . This is similar to the rationale behind the deterministic iterative/hierarchical predictors in [19, 12], except we model these relationships in a probabilistic sense, by defining
for . Now, the matrix parameter of the -th joint is a function of all its parent distributions, represented by the principal axes , singular values and modes for , as well as the shape distribution , global rotation , camera parameters and the input . Note that the parent distributions are themselves functions of their respective parent joints, while and are all functions of .
To predict the parameters of the hierarchical matrix-Fisher distribution in Equation 6, we propose a hierarchical neural network architecture , with weights (Figure 2). When considered as a black-box, yields the same set of outputs as . However, utilises the iterative hierarchical architecture presented in Figure 2, which amounts to multiple streams of fully-connected layers, each following one “limb” of the kinematic tree. In contrast, predicts pose similarly to shape, camera and global rotation parameters, using a single stream of fully-connected layers. We compare the naive independent formulation with the hierarchical formulation in Section 5.1.
3.5 Loss functions
Distribution prediction networks are trained with a synthetic dataset (Section 4).
Negative log-likelihood (NLL) loss on distribution parameters. The NLL corresponding to the Gaussian body shape distribution (Equation 4) is given by:
The NLL corresponding to the matrix-Fisher distribution over relative 3D joint rotations is defined as :
for , where may be obtained via the independent or hierarchical matrix-Fisher models presented above. Intuitively, the trace term pushes the predicted distribution mode (Equation 3) towards the target , while the log normalising constant acts as a regulariser, preventing the singular values of from getting too large . All predicted distribution parameters are dependent on the model weights, or , which are learnt in a maximum likelihood framework aiming to minimise the joint shape and pose NLL: .
Loss on global body rotation. We predict deterministic estimates of the global body rotation vectors , which are supervised using ground-truth global rotations , with loss . is the rotation matrix corresponding to .
2D joints loss on samples. Applying alone results in overly uncertain predicted 3D shape and pose distributions (see Section 5.1). To ensure that the predicted distributions match the visual evidence in the input , we impose a reprojection loss between ground-truth 2D joint coordinates (in the image plane) and predicted 2D joint samples, which are obtained by differentiably sampling 3D bodies from the predicted distributions and projecting to 2D using the predicted camera . Ground-truth 2D joints are computed from during synthetic training data generation (see Section 4).
We adapt the rejection sampler presented in  to sample from a matrix-Fisher distribution , modifying it to allow for backpropagation of gradients through the proposal sampling step (lines 5-7 in Algorithm 1). We refer the reader to  for further details about the rejection sampler. In short, to simulate a matrix-Fisher distribution with parameter we sample unit quaternions from a Bingham distribution  over the unit 3-sphere , with Bingham parameter computed from , and then convert the sampled quaternions into rotation matrices [20, 34] with the desired matrix-Fisher distribution. Rejection sampling is used to sample from the Bingham distribution, which has pdf for . The proposal distribution for the rejection sampler is an angular central Gaussian (ACG) distribution, with pdf . The ACG distribution is easily simulated  by sampling from a zero-mean Gaussian distribution with covariance matrix and normalising to unit-length (lines 5-7 in Algorithm 1). The re-parameterisation trick  is used to differentiably sample from this zero-mean Gaussian, thus allowing for backpropagation of gradients through the rejection sampler.
Algorithm 1 samples sets of relative 3D joint rotation matrices from the corresponding distributions . Furthermore, we differentiably sample SMPL shape vectors from the predicted Gaussian distribution , again using the re-parameterisation trick .
The body shape and 3D joint rotation samples are converted into 2D joint samples using the SMPL model and weak-perspective camera parameters
where is an orthographic projection. The reprojection loss applied between the predicted 2D joint samples and the visible target 2D joint coordinates is given by
where the visibilities of the target joints are denoted by (1 if visible, 0 otherwise).
4 Implementation Details
Synthetic training data. To train our 3D body shape and pose distribution prediction networks, we require a training dataset . We extend the synthetic training frameworks presented in [47, 48], which involve generating inputs and corresponding SMPL body shape and pose (i.e. 3D joint rotation) labels randomly and on-the-fly during training. In brief, for every training iteration, SMPL shapes are randomly sampled from a prior Gaussian distribution while relative 3D joint rotations and global rotation are chosen from the training sets of UP-3D , 3DPW  or Human3.6M . These are converted into training inputs and ground-truth 2D joint coordinates using the SMPL model and a light-weight renderer . Cropping, occlusion and noise augmentations are then applied to the synthetic inputs.
Previous synthetic training frameworks [47, 48, 53] often use silhouette-based training inputs. This necessitates accurate human silhouette segmentation at test-time, which may be challenging to do robustly. In contrast, our input representations consist of edge-images concatenated with 2D joint heatmaps. To generate edge-images, we first create synthetic RGB images by rendering textured SMPL meshes. For each training mesh, clothing textures are randomly chosen from [57, 2]. The textured SMPL mesh is rendered onto a background image (randomly chosen from LSUN ), using randomly-sampled lighting and camera parameters. Canny edge detection  is used to compute edge-images from the synthetic RGB images. We show in Section 5.1 that, despite the lack of photorealism in the synthetic RGB images, edge-filtering bridges the synthetic-to-real domain gap at test-time - and performs better than either silhouette-based or synthetic-RGB-based training inputs in our experiments. Examples of synthetic training samples are given in the supplementary material.
|Input Type||Architecture||2D Samples Loss||Synthetic Test Data||SSP-3D||3DPW|
|MPJPE-SC||PVE-T-SC||2D Joint Err.||PVE-T-SC||2D Joint Err.||MPJPE-SC|
|Silh. + J2DHmap||Independent||No||84.9||12.8||7.2 / 11.6||14.3||6.0 / 11.9||93.0|
|RGB + J2DHmap||Independent||No||79.9||11.3||7.1 / 11.7||14.0||5.9 / 12.0||92.8|
|Edge + J2DHmap||Independent||No||85.8||12.9||7.5 / 12.0||13.7||5.9 / 11.8||88.4|
|Edge + J2DHmap||Independent||Yes||86.3||13.2||7.6 / 8.9||13.9||6.2 / 9.6||91.3|
|Edge + J2DHmap||Hierarchical||No||84.4||12.8||7.3 / 10.4||13.6||5.3 / 11.2||87.7|
|Edge + J2DHmap||Hierarchical||Yes||79.1||12.6||6.7 / 6.9||13.6||4.8 / 6.9||84.7|
Training details. We use Adam 
with a learning rate of 0.0001, batch size of 80 and train for 150 epochs. For stability, the 2D joints reprojection loss is only applied on the mode pose and shape (projected to 2D) in the first 50 epochs and not on the samples, which are supervised in the next 100 epochs. To boost 3D pose metrics, an MSE loss on the mode 3D joint locations is applied in the final 50 epochs.
Evaluation datasets. 3DPW  is used to evaluate 3D pose prediction accuracy. We report mean-per-joint-position-error after scale correction (MPJPE-SC)  and after Procrustes analysis (MPJPE-PA), both in mm. Both metrics are computed using the mode 3D joint coordinates of the predicted shape and pose distributions.
SSP-3D is primarily used to evaluate 3D body shape prediction accuracy, using per-vertex Euclidean error in a T-pose after scale-correction (PVE-T-SC)  in mm, computed with the mode 3D body shape from the predicted shape distribution. We also evaluate 2D joint prediction error (2D Joint Err. Mode/Samples) in pixels, computed using both the mode 3D body and 10 3D bodies randomly sampled from the predicted shape and pose distributions, projected onto the image plane using the camera prediction. 2D joint error is evaluated on visible target 2D joints only.
Finally, we use a synthetic test dataset for our ablation studies investigating different input representations. It consists of 1000 synthetic input-label pairs, generated in the same way as the synthetic training data, with poses sampled from the test set of Human3.6M. .
5 Experimental Results
This section investigates different input representations and the benefits of the 2D joints samples loss, compares independent and hierarchical distribution predictors and benchmarks our method against the state-of-the-art.
5.1 Ablation studies
Input proxy representation. Rows 1-3 in Table 1 compare different choices of input proxy representation: binary silhouettes, RGB images and edge-filtered images (each additionally concatenated with 2D joint heatmaps). The independent network architecture is used for all three input types. To investigate the synthetic-to-real domain gap, metrics are presented for synthetic test data, as well as real test images from SSP-3D and 3DPW. For the latter, silhouette segmentation is carried out with DensePose . Using RGB-based input representations (row 2) results in the best 3D shape and pose metrics on synthetic data, which is reasonable since RGB contains more information than both silhouettes and edge-filtered images. However, metrics are significantly worse on real datasets, suggesting that the network has over-fitted to unrealistic artefacts present in low-fidelity (i.e. computationally cheap) synthetic RGB images. Silhouette-based input representations (row 1) also demonstrate a deterioration of 3D metrics on real test data compared to synthetic data, since they are heavily reliant upon accurate silhouettes, which are difficult to robustly segment in test images containing challenging poses or severe occlusions. Inaccurate silhouette segmentations critically impair the network’s ability to predict 3D body pose and shape. In contrast, edge-filtering is a simpler and more robust operation than segmentation, but is still able to retain important shape information from the RGB image. Thus, edge-images (concatenated with 2D joint heatmaps) can better bridge the synthetic-to-real domain gap, resulting in improved metrics on real test inputs (row 3).
Hierarchical architecture and reprojection loss on 2D joints samples. Figure 3 and rows 3-6 in Table 1 compare the independent and hierarchical distribution prediction architectures ( and ) presented in Section 3.4, both with and without the reprojection loss on sampled 2D joints () from Section 3.5. When is not applied, the shape and pose distributions predicted by both the independent and hierarchical network architectures do not consistently match the the input image, as evidenced by the significant gap between the visible 2D joint error computed using the distributions’ modes versus samples drawn from the distributions (in rows 3 and 5 of Table 1) on both synthetic test data and SSP-3D . This implies that the predicted distributions are overly uncertain about parts of the subject’s body that are visible and unambiguous in the input image. The visualisations corresponding to the hierarchical architecture trained without in Figure 3 (centre) further demonstrate that the predicted samples often do not match the input image, particularly at the extreme ends of the body. This results in significant undesirable per-vertex uncertainty over unambiguous body parts.
Applying to the independent network partially alleviates the mismatch between inputs and predicted samples, as shown by Figure 3 (right) and row 4 in Table 1, where the mode versus sample 2D joint error gap has reduced. However, training with deteriorates the independent architecture’s mode pose prediction metrics (MPJPE-SC and 2D Joint Err. Mode in row 3 vs 4 of Table 1) on both synthetic and real test data. This is because naively models each joint’s relative rotation independently of its parents’ rotations (Equation 5); however, to predict realistic human pose samples that match the visible input, each joint’s rotation distribution must be informed by its parents. attempts to force predicted samples to match the input despite this logical inconsistency, which causes a trade-off between mode and sample pose prediction metrics, particularly worsening MPJPE-SC.
In contrast, applying to the hierarchical network improves metrics corresponding to both mode and sample predictions, as shown by row 6 in Table 1. Now, each SMPL joint’s relative rotation distribution is conditioned on all its parents’ distributions (Equation 6). Thus, and work in conjunction in enabling predicted hierarchical distributions (and samples) to match the visible input, while yielding improved 3D metrics. Figure 3 (left) exhibits such visually-consistent samples and demonstrates greater prediction uncertainty for ambiguous parts. Note that uncertainty can arise even without occlusion in a monocular setting, e.g. due to depth ambiguities [50, 52] as shown by the left arm samples in the last row of Figure 3. Further visual results are in the supplementary material.
|HMR (unpaired) ||-||126.3||92.0|
|Ours w. Detectron2 ||96.2||84.7||59.2|
|Ours w. HRNet-W48 ||84.9||73.0||53.6|
|Max. input set size||Method||SSP-3D|
|HMR  + Mean||22.9|
|GraphCMR  + Mean||19.3|
|SPIN  + Mean||21.9|
|5||DaNet  + Mean||22.1|
|STRAPS  + Mean||14.4|
|Sengupta  + Mean||13.6|
|Sengupta  + Prob. Comb.||13.3|
|Ours + Mean||12.2|
|Ours + Prob. Comb.||12.0|
5.2 Comparison with the state-of-the-art
Shape prediction. Table 3 evaluates 3D body shape metrics on SSP-3D  for single image inputs and multi-image input sets, which we evaluate using both mean and probabilistic combination methods from . Our network surpasses the state-of-the-art , mainly due to our use of an edge-based proxy representation, instead of the silhouette-based representations used in  and . These methods rely on accurate human silhouettes, which may be difficult to compute at test-time, as discussed in Section 5.1, while our method does not have such dependencies. However, our method may result in erroneous shape predictions when the subject is wearing loose clothing which obscures body shape, in which case the shape prediction over-estimates the subject’s true proportions (see rows 1-2 in Figure 3).
Pose prediction. Table 2 evaluates 3D pose metrics on 3DPW . Our method is competitive with the state-of-the-art and surpasses other methods that do not require 3D-labelled training images [47, 48, 28, 19]. Figure 4(a) shows that our method performs well for most test examples in 3DPW, even matching pose-focused approaches that do not attempt to accurately predict diverse body shapes [36, 25]. However, some images in 3DPW contain significant occlusion, which can lead to noisy 2D joint heatmaps in the proxy representations, resulting in poor 3D pose metrics as shown by the right end of the curve in Figure 4(a).
Further quantitative comparison with other shape and pose distribution/multi-hypothesis prediction approaches is given in the supplementary material.
In this paper, we have proposed a probabilistic approach to the ill-posed problem of monocular 3D human shape and pose estimation, motivated by the fact that multiple 3D bodies may explain a given 2D image. Our method predicts a novel hierarchical matrix-Fisher distribution over relative 3D joint rotations and a Gaussian distribution over SMPL  shape parameters, from which we can sample any number of plausible 3D reconstructions. To ensure that the predicted distributions match the input image, we have implemented a differentiable rejection sampler to impose a loss between predicted 2D joint samples and ground-truth 2D joint coordinates. Our method is competitive with the state-of-the-art in terms of pose metrics on 3DPW, while surpassing the state-of-the-art for shape accuracy on SSP-3D.
Acknowledgements. We thank Dr. Yu Chen (Metail), Mr. Jim Downing (Metail), Dr. David Bruner (SizeStream) and Dr. Delman Lee (TAL Apparel) for providing body shape evaluation data and supporting this research.
Supplementary Material: Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild
Section 7 in this supplementary material contains implementation details, particularly regarding synthetic training data generation and per-vertex uncertainty visualisation. Section 8 discusses qualitative results on the SSP-3D  and 3DPW  datasets, and compares distribution predictions on images with versus without artificial occlusions. Table 5 compares several recent multi-hypothesis 3D human shape and pose estimation approaches.
7 Implementation Details
7.1 Synthetic Training Data
Our shape and pose distribution prediction neural networks are trained using synthetic training data, consisting of edge-and-joint-heatmap inputs paired with ground truth SMPL  shape and pose parameters. Inputs are rendered on-the-fly during model training using randomly sampled camera extrinsics, lighting, backgrounds and clothing textures. Examples of synthetic training and validation data are given in Figure 5. Note how each body pose may be paired with a different body shape, clothing, camera and background, as well as occlusion and noise augmentations. Thus, we are able to render highly diverse training data on-the-fly during training, enabling the network to see a new pose/shape/clothing/camera/background combination in each training iteration.
Our synthetic RGB images (Figure 5) are computationally cheap but clearly far from photorealistic, resulting in a large synthetic-to-real domain gap. However, simple edge detection  is able to significantly reduce this gap , motivating the use of edge-filtered images as part of our input proxy representation. We found that noisy edge detections (as seen in Figure 5 ) retained sufficient visual shape and pose information, and efforts to produce clean edge-images (e.g. hysteresis-based edge tracking or further hyperparameter tuning) did not improve performance.
) retained sufficient visual shape and pose information, and efforts to produce clean edge-images (e.g. hysteresis-based edge tracking or further hyperparameter tuning) did not improve performance.
The required body shape, pose, clothing and backgrounds are obtained as follows. For training, ground-truth SMPL 3D joint rotation matrices are sampled from the training splits of 3DPW  and UP-3D , as well as Human3.6M  subjects 1, 5, 6, 7 and 8, giving a total of 91106 training poses. Validation poses are sampled from the 3DPW/UP-3D validation splits and Human3.6M subjects 9 and 11, resulting in 33347 validation poses. SMPL body shape parameters are randomly sampled from for . RGB clothing textures for the SMPL body mesh are selected from SURREAL  and MultiGarmentNet , resulting in 917 training textures and 108 validation textures. Backgrounds are obtained from LSUN , which contains a collection of diverse indoor and outdoor scenes. We sample from 397582 different training backgrounds and 3000 different validation backgrounds. Note that background training images may contain other humans, which is intentional and essential for robustness against test images with multiple people. The network learns to focus on the person corresponding to the input joint heatmaps and ignore persons in the background.
Textured SMPL meshes are rendered with Pytorch3D , using a perspective camera model and Phong shading. Camera and lighting parameters are randomly sampled, with sampling hyperparameters given in Table 4. Generated images are cropped around the rendered body using a square bounding box, where the bounding box size is randomly scaled by a factor in range (0.8, 1.2).
To further bridge the gap synthetic-to-real gap, we implement random occlusion, body part removal, 2D joint removal and 2D joint noise augmentations during training. Hyperparameters associated with data augmentations are given in Table 6.
|Shape parameter sampling mean||0|
|Shape parameter sampling std.||1.25|
|Cam. translation sampling mean||(0, -0.2, 2.5) m|
|Cam. translation sampling var.||(0.05, 0.05, 0.25) m|
|Cam. focal length||300.0|
|Lighting ambient intensity range||[0.4, 0.8]|
|Lighting diffuse intensity range||[0.4, 0.8]|
|Lighting specular intensity range||[0.0, 0.5]|
|Bounding box scale factor range||[0.8, 1.2]|
|Proxy representation dimensions||pixels|
7.2 Visualisation of Per-Vertex Uncertainty
Figures 6, 7 and 8 in this supplementary material, as well as several figures in the main manuscript, visualise per-vertex 3D location uncertainties corresponding to the predicted shape and 3D joint rotation distributions. These are computed by i) sampling 100 shape parameter vectors and relative 3D joint rotations (for the entire kinematic tree) from the predicted distributions, ii) passing each of these samples through the SMPL function  to get the corresponding vertex meshes, iii) computing the mean location of each vertex over all the samples and iv) determining the average Euclidean distance from the sample mean for each vertex over all the samples, which is ultimately visualised in the vertex scatter plots as a measure of per-vertex 3D location uncertainty.
|Number of Samples:||1||5||10||25||1||5||10||25||1||5||10||25|
|Ours (Independent) w. HRNet ||88.3||85.0||82.6||78.5||56.6||54.5||52.8||50.2||13.9||12.9||12.0||10.3|
|Ours (Hierarchical) w. HRNet ||84.9||81.6||79.0||75.1||53.6||51.4||49.6||47.0||13.6||12.3||11.3||9.8|
|Body part occlusion||Occlusion probability||0.1|
|2D joints L/R swap||Swap probability||0.1|
|Half-image occlusion||Occlusion probability||0.05|
|2D joints removal||Removal probability||0.1|
|2D joints noise||Noise range||[-8, 8] pixels|
|Occlusion box||Probability, Size||0.5, 48 pixels|
8 Qualitative Results
Figure 7 presents results on artificially occluded images from SSP-3D . In particular, note that i) occluded/invisible body parts result in increased 3D location uncertainty for corresponding vertices and ii) 3D body samples from the predicted distributions match the visible body parts in the 2D image, while invisible body part samples are more diverse. However, occluded sample diversity is still somewhat limited and samples tend to be clustered around the mode predictions, which is a weakness of our method. This may be alleviated by predicting multi-modal distributions over 3D shape and pose in future work. Figure 7 also illustrates our method’s ability to predict a range of body shapes, owing to the synthetic training framework used.
Figure 6 presents results on the test split of 3DPW . Again, note the increased uncertainty and sample diversity for occluded and out-of-frame body parts, and the reprojection consistency between predicted samples and the visible bodies in the images. Results on 3DPW highlight another key challenge for future work: when faced with baggy/loose clothing, our method tends to over-estimate the subject’s body proportions. This is because our synthetic training data does not model the shape of clothing on the human body surface, but only its texture. Future work could focus on using synthetic clothed humans for training.
Figure 8 compares shape and pose distribution predictions on images from SSP-3D with versus without artificial occlusions, further corroborating that ambiguous parts result in greater uncertainty and more diverse 3D samples. However, it is again apparent that sample diversity for highly ambiguous parts is more limited than expected, as samples tend to be closely clustered around the mode prediction.
-  (2005) SCAPE: shape completion and animation of people. In ACM Transactions on Graphics (TOG) - Proceedings of SIGGRAPH, Vol. 24, pp. 408–416. Cited by: §2, §2.
-  (2019-10) Multi-garment net: learning to dress 3D people from images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §4, §7.1.
-  (2020) 3D multibodies: fitting sets of plausible 3D models to ambiguous image data. In NeurIPS, Cited by: §2, Table 2, Table 5.
-  (1994) Mixture density networks. Technical report . Cited by: §2.
-  (2016-10) Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.
-  (1986) A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 8 (6), pp. 679–698. Cited by: §3.3, §4, Figure 5, §7.1.
-  (2020) Real-time screen reading: reducing domain shift for one-shot learning. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §1, §7.1.
-  (2020) Pose2Mesh: graph convolutional network for 3D human pose and mesh recovery from a 2D human pose. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2, Table 2.
-  (2001) People tracking using hybrid monte carlo filtering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Vol. 2, pp. 321–328 vol.2. External Links: Cited by: §2.
Articulated body motion capture by annealed particle filtering.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Cited by: §2.
-  (1972-12) Orientation statistics. Biometrika 59 (3), pp. 665–676. External Links: Cited by: §1, §2, §3.2, §3.
-  (2020) Hierarchical kinematic human mesh recovery. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1, §2, §3.4.
-  (2020) Deep orientation uncertainty learning based on a bingham loss. In International Conference on Learning Representations, Cited by: §2.
-  (2019-06) HoloPose: holistic 3D human reconstruction in-the-wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
-  (2018) DensePose: dense human pose estimation in the wild. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §5.1, Table 6.
-  (2014-07) Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 36 (7), pp. 1325–1339. Cited by: §2, §4, §4, §7.1.
-  (2017) Generating multiple diverse hypotheses for human 3D pose consistent with 2D joint detections. In IEEE International Conference on Computer Vision (ICCV) Workshops (PeopleCap), Cited by: §2.
-  (2018-06) Total capture: a 3D deformation model for tracking faces, hands, and bodies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §2.
-  (2018) End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.4, §5.2, Table 2, Table 3.
-  (2013) A new method to simulate the Bingham and related distributions in directional data analysis with applications. External Links: Cited by: §3.5.
-  (1977) The Von Mises-Fisher matrix distribution in orientation statistics. Journal of the Royal Statistical Society. Series B (Methodological) 39 (1), pp. 95–106. External Links: Cited by: §1, §2, §3.2, §3.
-  (2014) Auto-encoding variational bayes. External Links: Cited by: §3.5, §3.5.
-  (2014) Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.
-  (2020) PointRend: image segmentation as rendering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2019) Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, §5.2, Table 2, Table 3.
-  (2019) Convolutional mesh regression for single-image human shape reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, Table 2, Table 3.
-  (2021) Probabilistic modeling for human mesh recovery. In ICCV, Cited by: Table 5.
-  (2020) Appearance consensus driven self-supervised human mesh recovery. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §5.2, Table 2.
-  (2017) Unite the People: closing the loop between 3D and 2D human representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4, §7.1.
-  (2018) Bayesian attitude estimation with the matrix fisher distribution on so(3). IEEE Transactions on Automatic Control 63 (10), pp. 3377–3392. External Links: Cited by: §3.2, §3.2.
-  (2019-06) Generating multiple hypotheses for 3D human pose estimation with mixture density network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2021) HybrIK: a hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In CVPR, Cited by: Table 2.
-  (2015) SMPL: a skinned multi-person linear model. In ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH Asia, Vol. 34, pp. 248:1–248:16. Cited by: Figure 2, §1, §2, §2, §2, §3.1, §3, §6, §7.1, §7.2.
-  (2000) Directional statistics. Wiley. Cited by: §1, §2, §3.2, §3.5, §3.
-  (2020) Probabilistic orientation estimation with matrix fisher distributions. In Advances in Neural Information Processing Systems, Vol. 33. Cited by: §1, §2, §3.2, §3.5.
-  (2020) I2L-MeshNet: image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single rgb image. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1, §2, §5.2, Table 2.
-  (2020) GraphMDN: leveraging graph structure and deep learning to solve inverse problems. CoRR abs/2010.13668. External Links: Cited by: §2.
-  (2018) Neural body fitting: unifying deep learning and model-based human pose and shape estimation. In Proceedings of the International Conference on 3D Vision (3DV), Cited by: §1, §2.
-  (2019) Expressive body capture: 3D hands, face, and body from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §2.
-  (2019) TexturePose: supervising human mesh estimation with texture consistency. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.
-  (2018) Learning to estimate 3D human pose and shape from a single color image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2, §3.3.
-  (2018-09) Deep directional statistics: pose estimation with uncertainty quantification. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2020) Accelerating 3D deep learning with PyTorch3D. arXiv:2007.08501. Cited by: §4, §7.1.
Variational inference with normalizing flows.
Proceedings of the International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 1530–1538. Cited by: §2.
-  (2019-10) PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
-  (2020-06) PIFuHD: multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2020-09) Synthetic training for accurate 3D human pose and shape estimation in the wild. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: 3rd item, §1, §1, §2, §3.3, Table 1, §4, §4, §4, §4, §5.1, §5.2, §5.2, Table 2, Table 3, §6, §7.1, Figure 7, Figure 8, §8.
-  (2021) Probabilistic 3D human shape and pose estimation from multiple unconstrained images in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: 3rd item, §1, §2, §2, §3.4, §4, §4, §5.2, §5.2, Table 2, Table 3, Table 5.
-  (2005) Discriminative density propagation for 3D human motion estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 390–397 vol. 1. External Links: Cited by: §2.
-  (2001) Covariance scaled sampling for monocular 3D body tracking.. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Cited by: §2, §5.1, §8.
-  (2002) Hyperdynamics importance sampling. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2003) Kinematic jump processes for monocular 3D human tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Cited by: §2, §5.1, §8.
-  (2019) Towards accurate 3D human body reconstruction from silhouettes. In Proceedings of the International Conference on 3D Vision (3DV), External Links: Cited by: §1, §1, §2, §4.
-  (2019) Deep high-resolution representation learning for human pose estimation. In CVPR, Cited by: §3.3, Table 2, Table 5.
-  (2017) Indirect deep structured learning for 3D human shape and pose prediction. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §1, §2.
-  (2018) BodyNet: volumetric inference of 3D human body shapes. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1, §2.
-  (2017) Learning from synthetic humans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4, §7.1.
-  (2018) Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2, Table 1, §4, §4, §5.2, Table 2, §6, Figure 6, §7.1, §8.
-  (2021) Probabilistic monocular 3D human pose estimation with normalizing flows. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
-  (2019) Detectron2. Note: https://github.com/facebookresearch/detectron2 Cited by: Table 2.
-  (2019) DenseRaC: joint 3D pose and shape estimation by dense render-and-compare. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
-  (2015) LSUN: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365. Cited by: §4, §7.1.
-  (2018) Monocular 3D pose and shape estimation of multiple people in natural scenes - the importance of multiple scene constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2020-06) 3D human mesh regression with dense correspondence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2019) DaNet: decompose-and-aggregate network for 3D human shape and pose estimation. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 935–944. Cited by: §1, §2, Table 2, Table 3.