Log In Sign Up

Collaborative Regression of Expressive Bodies using Moderation

by   Yao Feng, et al.
Max Planck Society

Recovering expressive humans from images is essential for understanding human behavior. Methods that estimate 3D bodies, faces, or hands have progressed significantly, yet separately. Face methods recover accurate 3D shape and geometric details, but need a tight crop and struggle with extreme views and low resolution. Whole-body methods are robust to a wide range of poses and resolutions, but provide only a rough 3D face shape without details like wrinkles. To get the best of both worlds, we introduce PIXIE, which produces animatable, whole-body 3D avatars from a single image, with realistic facial detail. To get accurate whole bodies, PIXIE uses two key observations. First, body parts are correlated, but existing work combines independent estimates from body, face, and hand experts, by trusting them equally. PIXIE introduces a novel moderator that merges the features of the experts, weighted by their confidence. Uniquely, part experts can contribute to the whole, using SMPL-X's shared shape space across all body parts. Second, human shape is highly correlated with gender, but existing work ignores this. We label training images as male, female, or non-binary, and train PIXIE to infer "gendered" 3D body shapes with a novel shape loss. In addition to 3D body pose and shape parameters, PIXIE estimates expression, illumination, albedo and 3D surface displacements for the face. Quantitative and qualitative evaluation shows that PIXIE estimates 3D humans with a more accurate whole-body shape and detailed face shape than the state of the art. Our models and code are available for research at


page 7

page 8

page 15

page 16

page 17

page 18

page 19

page 20


Monocular Expressive Body Regression through Body-Driven Attention

To understand how people look, interact, or perform tasks, we need to qu...

Detailed Human Shape Estimation from a Single Image by Hierarchical Mesh Deformation

This paper presents a novel framework to recover detailed human body sha...

Expressive Body Capture: 3D Hands, Face, and Body from a Single Image

To facilitate the analysis of human actions, interactions and emotions, ...

DOPE: Distillation Of Part Experts for whole-body 3D pose estimation in the wild

We introduce DOPE, the first method to detect and estimate whole-body 3D...

BARC: Learning to Regress 3D Dog Shape from Images by Exploiting Breed Information

Our goal is to recover the 3D shape and pose of dogs from a single image...

Extreme 3D Face Reconstruction: Looking Past Occlusions

Existing single view, 3D face reconstruction methods can produce beautif...

Monocular Real-time Full Body Capture with Inter-part Correlations

We present the first method for real-time full body capture that estimat...

1 Introduction

To model human behavior, we need to capture how people look, how they feel, and how they interact with each other. To facilitate this, our goal is to reconstruct whole-body 3D shape and pose, facial expressions, and hand gestures from an RGB image. This is challenging, as humans vary in shape and appearance, they are highly articulated, they wear complex clothing, they are often occluded, and their face and hands are small, yet highly deformable. For these reasons, the community studies the body [15, 48, 53], hands [16, 31, 40, 108] and face [23] mostly separately.

Recent whole-body statistical models [47, 69, 102] provide the tools to address the problem holistically, by jointly capturing the body, face and hand surface. ExPose [20] reconstructs whole-bodied SMPL-X [69] meshes from an RGB image, using “expert” sub-networks for the body, face and hands. ExPose sometimes struggles because it uses totally independent experts, and combines their estimates naively by always fully trusting them. However, different images pose different challenges to each expert. Moreover, the shape of all body parts of a person is naturally correlated. Last, part experts operate at different spatial scales and have complementary strengths.

On one hand, face-only methods [24, 103] recover accurate facial shape, albedo and geometric details, which are important to reason about emotions. However, they need a tight crop around the face and struggle with extreme viewing angles, and faces that are small, low-resolution or occluded. On the other hand, whole-body methods [20, 47, 69, 76, 102] handle these challenges well, but estimate average-looking shapes, and cannot estimate facial albedo and fine details.

To get the best of all worlds, we introduce PIXIE (“Pixels to Individuals: eXpressive Image-based Estimation”). PIXIE estimates expressive whole-bodied 3D humans from an RGB image with a much more realistic shape and pose than existing work, and a significantly more detailed face. For this, it pushes the state of the art in a threefold way.

First, PIXIE learns not only experts for the body, face and hands, but also a novel moderator that estimates their confidence for each image, and fuses their features weighted by this. The learned fusion helps improve whole-body shape, using SMPL-X’s shared shape space across all body parts. Moreover, it helps to robustly estimate head and hand pose under strong ambiguities (e.g. occlusions or blur) using the context of the full body; see Fig. 2 for examples.

Second, PIXIE significantly improves “gendered” body shape realism. Most body regression methods estimate a highly inaccurate body shape – often with the wrong gender or with a gender-neutral shape. While human shape is highly correlated with gender, existing methods ignore this. The only exception is SMPLify-X, but it uses an offline gender classifier and fits a gender-specific SMPL-X model. Instead, using a single unisex SMPL-X model enables end-to-end training of neural nets. PIXIE adopts this approach, and learns to implicitly reason about shape. To enable this, we define male, female, and non-binary priors on body shape within the SMPL-X shape space. At training time, given automatically created gender labels for input images, we train PIXIE to output plausible shape parameters for the gender. At inference time, PIXIE needs no labels and is applicable to any in-the-wild image. Note that this approach is general and is relevant for the broader community (face, hands, body, whole-body). Body shape is also highly correlated with face shape

[30, 37, 52]. We do the same gendered training for our face expert, consequently, PIXIE can – uniquely – use face shape to improve body shape. This training and network architecture significantly improves body shape both qualitatively and quantitatively.

Figure 2: PIXIE infers the confidence of its body, face and hand experts, and fuses their features accordingly. Challenges, like occlusions, are resolved with full-body context. (L) Input image. (R) Color-coded part-expert confidence.

Third, PIXIE’s face expert additionally infers facial albedo and dense 3D facial-surface displacements. For this, we draw inspiration from Feng et al. [24], and go beyond them in three ways: (1) We use a whole-body shape space, rather than a face-only space, to capture correlations between the body and face shape. (2) We use a photo-metric and identity loss on faces to inform whole-body shape. (3) We use the inferred geometric details only when the face expert is confident, as judged by our moderator. As shown in Fig. 1, this results in whole-body 3D humans with previously unseen detailed faces, which can be fully animated.

To summarize, here we make three key contributions: (1) We train a novel moderator, that infers the confidence of body-part experts and fuses their features weighted by this. This improves shape and pose inference under ambiguities. (2) We train the network to implicitly reason about gender, i.e. without gender labels at test time, with a novel “gendered” 3D shape loss that encourages likely body shapes. (3) We extend our face expert with branches that estimate facial albedo and 3D facial-surface displacements, enabling whole-body animation with a realistic face.

PIXIE makes a significant step towards automatic, accurate and realistic 3D avatar creation from a single RGB image. Our models and code are available for research at

2 Related work

Body reconstruction: For years, the community focused on the prediction of 2D or 3D landmarks for the body [19], face [17] and hands [83, 99], with a recent shift towards estimating 3D model parameters [15, 48, 67, 71] or 3D surfaces [54, 79, 80, 96]. One line of work simplifies the problem by using proxy representations like 2D joints [15, 34, 35, 42, 64, 71, 82, 93, 109], silhouettes [8, 42, 71], part labels [67, 78] or dense correspondences [75, 105]. These are then “lifted” to 3D, either as part of an energy term that is minimized [15, 42, 104] or using a regressor [64, 67, 71, 93]. To overcome ambiguities, they use priors such as known limb lengths [58], limits on joint angles [9], or a statistical body model [15, 42, 67, 69, 71] like SMPL [61] or SMPL-X [69]. While these approaches benefit from 2D annotations, they cannot overcome errors in the proxy features and do not fully exploit image context. The alternative is to directly regress from the image pixels. Such methods estimate 3D skeletons [59, 70, 86, 87, 89], statistical model parameters [20, 27, 48, 49, 53] or non-parametric representations, like 3D meshes [54], depth maps [29, 84], 3D voxels [96, 110] or 3D distance fields [79, 80].

Face reconstruction: Most modern monocular 3D face reconstruction methods estimate the parameters of a pre-computed statistical face model [23]. Similar to the body literature, this problem is tackled with both optimization [10, 14, 97, 92] and regression methods [25, 44, 81, 90]. Many learning-based approaches follow an analysis-by-synthesis strategy [22, 90, 91], which jointly estimates geometry, albedo, and lighting, to render a synthetic image [62, 72], which is compared with the input. Recent work [24, 22, 33]

further employs face-recognition terms

[18] during training to reconstruct more accurate facial geometry. Even geometric details, such as wrinkles, can be learned from big collections of in-the-wild images [24, 94]. We refer to Egger et al. [23] for a comprehensive overview. The major downsides of face-specific approaches are their need for tightly cropped face images and their inability to handle non-frontal images. The latter is mainly due to the lack of supervision; 2D landmarks may be missing, in which case the photometric term is not applicable. By integrating face and body regression, PIXIE regresses head pose and shape robustly in situations where face-only methods fail. Further, the integration means that body shape estimation benefits from the face, which is correlated with body shape.

Hand reconstruction: While hand pose estimation is most often performed from RGB-D data, there has been a recent shift towards the use of monocular RGB images [13, 16, 39, 40, 43, 56, 65, 88, 113]. Similar to the body, we split these into methods that predict 3D joints [43, 65, 88, 113], parameters of a statistical hand model [13, 16, 40, 56, 108], such as MANO [74], or a 3D surface [31, 55].

Whole-body reconstruction: Recent methods approach the problem of human reconstruction holistically. Some of these estimate 3D landmarks for the body, face and hands [45, 100], but not their 3D surface. This is addressed by whole-body statistical models [47, 69, 102], that jointly capture the 3D surface for the body, face and hands.

SMPLify-X [69] fits SMPL-X [69] to 2D body, hand and face keypoints [19] estimated in an image. Xiang et al. [101] estimate both 2D keypoints and a part orientation field and fit Adam [47] to these. Xu et al. [102] fit GHUM [102] to detected body-part image regions. While these methods work, they are based on optimization, consequently they are slow and do not scale up to large datasets.

Deep-learning methods [20, 76] tackle these limitations, and quickly regress SMPL-X parameters from an image. ExPose [20] uses “expert” sub-networks for the body, face and hands, and merges their output by always trusting them. To account for the different sizes of these body parts, it uses body-driven attention, and multiple data sources for both part-only and whole-body supervision. FrankMocap [76] takes a similar approach, but uses no attention and does not estimate the face. FrankMocap includes an optional optimization step to improve the alignment with the image. Zhou et al. [112] train a network to regress SMPL+H [74] and MoFA [91] from an RGB image, following a body-part attention mechanism and multi-source training as ExPose. We draw inspiration by these methods and go further: (1) PIXIE assesses the confidence of its experts, and fuses their features weighted by this, to robustify body inference. (2) PIXIE improves shape realism, by learning to implicitly reason about gender, to estimate a suitable shape for this. (3) PIXIE adds albedo, predicted by its face expert, and geometric details to the SMPL-X face, using the state-of-the-art module of Feng et al. [24] on confident face predictions.

Concurrently to us, Zhou et al. [112] (arXiv, ) have targeted similar goals. In this direction, they estimate configurations of the body-and-hands SMPL+H model [74] and the MoFA [91] face model. Note that these are disparate models, which are (offline) manually cut-and-stitched together. Instead, we use the whole-body SMPL-X model [69] that elegantly captures all body parts together, thus no stitching is required. Zhou et al. fuse only hand-body features in a “binary” fashion, while their face model is “disconnected” from the body. Instead, we fuse both face-body and hand-body features in a “fully analog” fusion, and thus our face expert can – uniquely – inform whole-body shape. Zhou et al. have no face camera, and need PnP-RANSAC [28] and PAto align their face to the image. Instead, we infer a face-specific camera and need no extra steps. Zhou et al. use a complicated architecture, with several modules that can only be trained separately, and is applicable only for whole bodies. Instead, we use no intermediate tasks to avoid more sources of error, and train our model end to end. Our full model is applicable for whole bodies, but our part experts are also (separately) applicable on part-only data.

3 Method

Here we introduce PIXIE, a novel model for reconstructing SMPL-X [69] humans with a more realistic face from a single RGB image. It uses a set of expert sub-networks for body, face/head, and hand regression, and combines them in a bigger network architecture with three main novelties: (1) We use a novel moderator that assesses the confidence of part experts and fuses their features weighted by this, for robust inference under ambiguities, like strong occlusions. (2) We use a novel “gendered” shape loss, to improve body shape realism by learning to implicitly reason about gender. (3) In addition to the albedo predicted by our face expert, we employ the surface details branch of Feng et al. [24].

3.1 Expressive 3D Body Model

We use the expressive SMPL-X [69] body model that captures whole-body pose and shape, including facial expressions and finger articulation. It is a differentiable function , parameterized by shape , pose and expression , that produces a 3D mesh. The shape parameters are coefficients of a linear shape space, learned from CAESAR [73] registered scans. This is a joint shape space for the body, face, and hands, naturally capturing their shape correlations. The expression parameters are also coefficients of a low-dimensional linear space. The overall pose parameters

consist of body, jaw and hand pose vectors. Each joint rotation is encoded as a 6D vector

[111], except for the jaw, which uses Euler angles, i.e. a 3D vector. We follow the notation of [48] and denote posed joints with , where .

Camera: To reconstruct SMPL-X from images, we use the weak-perspective camera model with scale and translation . We denote the joints and model vertices projected on the image with and .

Figure 3: Body, face/head and hand image crops , , are fed to the expert encoders , , to produce part-specific features , , . Our novel moderators , estimate the confidence of experts for these images, and fuse face-body and hand-body features weighted by this, to create . These are fed to for robust regression. The facial branch [24] estimates fine geometric details. Icon from Freepik.

3.2 PIXIE Architecture

PIXIE uses the architecture of Fig. 3, and is trained end to end. All model components are described below.

Input images: Given an image with full resolution, we assume a bounding box around the body. We use this to crop and downsample the body to to feed our network. However, this makes hands and faces too low resolution. We thus use an attention mechanism [20] to extract from high-resolution crops for the face/head, , and hand, .

Feature encoding: We feed to separate expert encoders to extract features . We use a ResNet-50 [41] for the face/head and hand experts to generate . The body expert uses a HRNet [85], followed by convolutional layers that aggregate the multi-scale feature maps, to generate .

Feature fusion (moderator): We identify the expert pairs of {body, head} and {body, hand} as complementary, and learn the novel moderators that build “fused” features and feed them to face/head and hand regressors

(described below) for more informed inference. A moderator is implemented as a multi-layer perceptron (MLP) and gets the body,

, and part, ( or ), features and fuses them with a weighted sum:


where ( or ) is the part moderator, ( or ) is the expert’s confidence, and ( or ) is the body feature transformed by the respective “extractor”, i.e. the linear layer ( or ) between the body encoder and part moderator . Finally, is a learned temperature weight, jointly trained with all network weights with the losses of Sec. 3.3, with no -specific supervision.

Parameter regression: We use two main regressor types: (1) We use the body, face/head, and hand regressors, that get features only from the respective expert encoder {, , }. infers the camera , and body rotation and pose up to (excluding) the head and wrist. infers the camera , face albedo , and lighting . infers the camera . (2) We use the face/head, , and hand, , regressors that get from moderators the “fused” features, and . infers the wrist and finger pose . infers expressions , head rotation , and jaw pose . Importantly, also infers body shape , letting our face expert contribute to whole-body shape.

Detail capture: We use the fine geometric details branch of Feng et al. [24] that, given a face image , estimates dense 3D displacements on top of FLAME’s [60] surface. We convert the displacements from FLAME’s to SMPL-X’s UV map, and apply them on PIXIE’s inferred head shape. However, inferring geometric details from full body images is not trivial; faces tend to be much noisier in these w.r.t. to face-only images. We account for this with our moderator, and use the inferred displacements only when the face/head expert is confident, i.e. the face image is not too noisy.

3.3 Training Losses

To train PIXIE we use body, hand and face losses:


defined as follows; the hat (e.g. ) denotes ground truth.

Body losses: Following [20], we use a combination of a 2D re-projection, a 3D joint and a SMPL-X parameter loss:


Hand losses: We employ a similar set of losses to train the 3D hand pose and shape estimation network:


defined similarly to and of the body, but using the hand joints and pose parameters and .

Face losses: For the head training data, we follow standard practices established in the 3D face estimation community:


The landmark loss measures the difference between ground-truth 2D landmarks and the respective model landmarks (originally on ) projected on the image plane, :


We also compute a loss for the set of landmarks on the upper eyelid, lower eyelid, upper lip, and lower lip:


The face parameter loss follows , but for face pose only. This loss is only used for face crops from body data, when the target face pose is available.

Given the predicted 3D face mesh as a subset of , face albedo and lighting , we render a synthetic image for the input subject using the differentiable renderer from Pytorch3D [72]. We then minimize the difference between the input face image and the rendered image :


where S is a binary face mask with value in the face skin region, and elsewhere, and denotes the Hadamard product. The segmentation mask prevents errors from non-face regions influencing the optimization, and we use the segmentation network of Nirkin et al. [66] to extract S. The image formation process is the same as in Feng et al. [24].

Following [22, 32], we use a pre-trained face recognition network [18], , to compute embeddings for the rendered image and the input . We encourage the network to produce a synthetic face image with an identity similar to the ground-truth:


Priors: Due to the difficulty of the problem, we use additional priors to constrain PIXIE to generate plausible solutions. For expression parameters, we use a Gaussian prior:


We also add soft regularization on jaw and face pose:


All these priors are “standard” regularizers, empirically found to discourage implausible configurations (extreme values, unrealistic shape/pose, inter-penetrations, etc).

Gender: As gender strongly affects body shape, we use a gender-specific shape prior during training, when gender labels are available. For this, we register SMPL-X to CAESAR [73] scans, and compute the mean and covariance of shape parameters for each gender. We then use:


When gender is unknown, we use a Gaussian prior computed over all scans/registrations, irrespective of gender. Please note that we do not need gender labels for inference.

Feature update loss: We encourage the transformed body features ( or ) to match with a loss that was empirically found to stabilize network training:


3.4 Implementation Details

Training data: For whole-body data we use the curated SMPL-X fits of [20]

, and SMPL-X fits to whole-body COCO data 

[45]. For hand-only data we use FreiHAND [114] and Total Motion [101]. For face/head data we use VGGFace2 [18] and detect 2D landmarks with the method of Bulat et al. [17]. We get gender annotations by running the method of Rothe et al. [77] on many photos per identity and using majority voting, to improve robustness. For data augmentation, see Sup. Mat.

Network training: We do multi-step training which empirically aids stability. We pre-train on part-only data, and train on whole-body data end to end; for details see Sup. Mat.

4 Experiments

Method Type Body model Time (s) PA-V2V (mm) TR-V2V (mm) PA-MPJPE (mm) PA-P2S (mm)
All Body L/R hand Face All Body L/R hand Face MPJPE-14 L/R hand Mean Median
 [69] O SMPL-X  40-60 52.9 56.37 11.4/12.6 4.4 79.5 92.3 21.3/22.1 10.9 73.5 12.9/13.2 28.9 18.1
SMPLify-X [69] O SMPL-X  40-60 65.3 75.4 11.6/12.9 4.9 93.0 116.1 23.8/24.9 11.5 87.6 12.2/13.5 36.8 23.0
MTC [101] O Adam 20 N/A N/A N/A N/A N/A N/A N/A N/A 107.8 16.3/17.0 41.3 29.0
SPIN [53] R SMPL 0.01 N/A 60.6 N/A N/A N/A 96.8 N/A N/A 102.9 N/A 40.8 28.7
FrankMocap [76] R SMPL-X 0.08 57.5 52.7 12.8/12.4 N/A 76.9 80.1 32.1 / 31.9 N/A 62.3 13.2/12.6 31.6 19.2
ExPose [20] R SMPL-X 0.16 54.5 52.6 13.1/12.5 4.8 65.7 76.8 31.2 / 32.4 15.9 62.8 13.5/12.7 28.9 18.0
PIXIE (ours) R SMPL-X 0.08-0.10 55.0 53.0 11.2/11.0 4.6 67.6 75.8 25.6/27.0 14.2 61.5 11.7/11.4 29.9 18.4
Table 1: Evaluation on EHF [69]. PIXIE is on par with the state of the art w.r.t. body performance, but predicts better face shapes and hand poses. uses the ground-truth focal length (excluded from bold). Run-times were measured on an Intel Xeon W-2123 3.60GHz machine with a NVIDIA Quadro P5000 GPU. “O/R” denotes Optimization/Regression.

4.1 Evaluation Datasets

EHF [69]: We evaluate whole-body accuracy on this. It has RGB images of minimally-clothed subject in lab settings with ground-truth SMPL-X meshes and 3D scans.

AGORA [11]: We evaluate whole-body accuracy on this. It has rendered [6] photo-realistic images of 3D human scans [1, 2, 4, 5] in scenes [7, 3]. It has SMPL-X ground truth recovered from scans, images and semantic labels [106].

3DPW [98]: We evaluate main-body accuracy on this. It captures subjects in indoor/outdoor videos with SMPL pseudo ground truth, recovered from images and IMUs.

NoW [81]: We use it to evaluate face/head-only accuracy. It contains a 3D head scan for subjects, and images with various viewing angles and facial expressions.

FreiHAND [114]: We evaluate hand-only accuracy on this. It has hand/hand-object images of subjects, with MANO ground truth, recovered from multi-view images.

4.2 Evaluation Metrics

Mesh alignment: Prior to computing a metric, we align estimated meshes to ground-truth ones. The prefix “PA” denotes Procrustes Alignment (solving for scale, rotation and translation), while “TR” denotes translation alignment. “TR” is stricter, as it does not factor out scale and rotation. For hand-/face-only metrics, we align each part separately.

Mean Per-Joint Position Error (MPJPE): We report the mean Euclidean distance between the estimated and ground-truth joints. For the body-only metric, we compute the LSP-common joints [46] as a common skeleton across different body models, using a linear joint regressor [15, 57] on the estimated and ground-truth vertices. This is a standard metric, but is too sparse; it cannot capture errors in full 3D shape (i.e. surface), or all limb rotation errors.

Vertex-to-Vertex (V2V): For methods that infer meshes with the same topology as the ground-truth ones, e.g. SMPL(-X) estimations and SMPL(-X) ground truth, we compute the mean per-vertex error by taking into account all vertices. This is not possible for methods with different topology, e.g. SMPL estimations for SMPL-X ground truth, and vice versa. For such cases, we compute a main-body variant of V2V, as SMPL and SMPL-X share the same topology for the main body, excluding the hands and head. V2V is stricter than MPJPE; it also captures 3D shape errors and unnatural limb rotations (for the same joint positions).

Point-to-Surface (P2S): To compare PIXIE with methods that use a different mesh topology to SMPL(-X), e.g. MTC [101], we measure the mean distance from ground-truth vertices to the surface of the estimated mesh. P2S is stricter than MPJPE; it captures errors for 3D shape, but not for unnatural limb rotations (for the same joint positions).

4.3 Quantitative Evaluation

Whole-body. In Tab. 1 - 3 we report whole-body metrics (“All”), by taking into account the body, face and hands jointly. We add body-only (“Body”), hand-only (“L/R hand”), and face-only (“Face”) variants for completeness.

EHF [69]: Table 1 compares PIXIE to three baseline sets: (1) the optimization-based SMPLify-X [69] and MTC [101] that infer SMPL-X and Adam, (2) the regression-based SPIN [53] that infers SMPL, and (3) the regression-based ExPose [20] and FrankMocap [76] that infer SMPL-X.

Note that MTC and FrankMocap do not estimate the face. PIXIE outperforms optimization methods on most metrics, while being significantly faster. Moreover, it is on par with regression methods, both in terms of error metrics and runtime, which drops to sec for known body-part crops.

Method PA-V2V (mm) TR-V2V (mm)
All Body L/R hand Face All Body L/R hand Face
FrankMocap [76] 58.6 56.3 11.5/12.2 N/A 85.6 88.9 41.1/41.4 N/A
ExPose [20] 59.3 55.3 11.2/11.8 2.1 87.4 89.8 45.3/43.9 22.2
PIXIE (ours) 56.3 52.2 12.1/12.1 2.2 85.1 87.6 32.8/31.8 24.2
Table 2: Evaluation on AGORA [11]. PIXIE estimates more accurate whole-body meshes than state-of-the-art methods, especially for the stricter TR-V2V metric.
Method PA-V2V (mm) TR-V2V (mm)
All Body All Body
Naive Body 59.7 54.3 70.5 83.4
“Copy-paste” 60.3 55.5 72.9 82.4
PIXIE (ours) 55.0 53.0 67.6 75.8
Table 3: Ablation for our moderator on EHF [69]. “Naive body” denotes a single regressor for the whole body, and “Copy-Paste” denotes a naive integration of the independent expert estimations on the inferred body. PIXIE’s moderator clearly outperforms all baselines.

AGORA [11]: Table 2 compares PIXIE to ExPose [20] and FrankMocap [76] on the more challenging AGORA dataset. For the PA-V2V metric, PIXIE is on par with baselines of the hands and face, but better for the whole body. For the stricter TR-V2V metric, PIXIE clearly outperforms both baselines, especially for hands. To identify the reasons for PIXIE’s better performance, we conduct ablations studies.

Ablation for moderators: Table 3 compares PIXIE to naive whole-body regression (no body-part experts) and the “copy-paste” fusion strategy [20, 76]. The naive version does not benefit from the expertize of the part experts. “Copy-paste” fusion can lead to erroneous hand/face orientation inference, since the respective experts lack global context. Moreover, estimating whole-body shape from a face image is not always reliable, e.g. when a person faces away from the camera (Fig. 2). PIXIE gracefully fuses “global” body and “local” part features with its moderators.

Ablation for gendered shape prior on 3DPW [98]: By removing our gendered-shape loss, the PA-V2V error increases from to mm. A similar qualitative ablation can be seen in Fig. 6; learned implicit reasoning about gender leads to more realistic 3D body shapes. Note that SMPL-X’s shared shape space for the whole body, lets all body parts – especially the face – contribute to body shape.

Parts-only. For completeness, we use standard benchmarks for body-only, face-only, and hand-only evaluation.

model (mm) (mm) (mm)
HMR [48] SMPL 81.3 130.0 65.2
SPIN [53] SMPL 59.2 96.9 53.0
FrankMocap [76] SMPL-X 61.9 96.7 55.1
ExPose [20] SMPL-X 60.7 93.4 55.6
PIXIE (ours) SMPL-X 61.3 91.0 50.9
Table 4: Evaluation on 3DPW [98]. PIXIE is the best for the stricter TR-MPJPE (joints) and V2V (surface) metrics.

Body-only on 3DPW [98]: Table 4 shows that PIXIE performs on par with SPIN [53], FrankMocap [76] and ExPose [20] for the PA-MPJPE metric, but outperforms them for the stricter TR-MPJPE (joints) and V2V (surface) metrics.

Face-only on NoW [81]: Table 5 shows that PIXIE outperforms not only the expressive whole-body method ExPose [20], but also strong and dedicated face-only methods, with the only exception being the recent work of Feng et al. [24].

Hand-only on FreiHAND [114]: Table 6 shows that our hand expert performs on par with the whole-body method ExPose [20], and the hand-specific “MANO CNN” [114], but clearly outperforms the hand expert of Zhou et al. [112].

Method PA-P2S for face/head (mm)
Median (mm) Mean (mm) Std (mm)
3DMM-CNN [95] 1.84 2.33 2.05
PRNet [25] 1.50 1.98 1.88
Deng et al. [22] 1.23 1.54 1.29
RingNet [81] 1.21 1.54 1.31
3DDFA-V2 [38] 1.23 1.57 1.39
DECA [24] 1.09 1.38 1.18
ExPose [20] 1.26 1.57 1.32
PIXIE (ours) 1.18 1.49 1.25
Table 5: Evaluation on NoW [81]. PIXIE is better than the whole-body ExPose. PIXIE outperforms many strong face-specific methods, and is on par with the best ones.
(mm) (mm) 5mm 15mm
MANO CNN” [114] 11.0 10.9 0.516 0.934
ExPose [20] hand expert 12.2 11.8 0.484 0.918
Zhou et al. [112] 15.7 - - -
PIXIE hand expert 12.0 12.1 0.468 0.919
Table 6: Evaluation on FreiHAND [114]. PIXIE’s hand expert is on par with the hand-specific “MANO CNN” and ExPose, but clearly outperforms the more related Zhou et al. that also uses hand-body feature fusion.

4.4 Qualitative Evaluation

Figure 4: Qualitative comparison: From left to right: (1) RGB Image, (2) ExPose [20], (3) FrankMocap [76], (4) PIXIE with predicted albedo and lighting.

We qualitatively evaluate our approach on a wealth of images from the web (e.g., Figure 4 compares PIXIE against FrankMocap [76] and ExPose [20], which also regresses SMPL-X. Both baselines fail when the hand expert faces ambiguities (row ); PIXIE gains robustness by using the full-body context. Both baselines give body shapes that look average (rows , ) or have the wrong gender (rows , ); PIXIE gives the most realistic shapes due to its “gendered” shape loss. FrankMocap fails for strong occlusions (rows , ). Last, ExPose struggles with accurate facial expressions and FrankMocap with head rotations (rows , ); PIXIE outperforms both with its strong face/head expert and predicts a more realistic face.

Figure 5: Comparison with Zhou et al. [112]. From left to right: (1) RGB image, (2) Zhou et al., (3) PIXIE with inferred facial details and (4) inferred albedo and lighting. Note that Zhou et al. use tight face crops through Dlib [50] to improve performance; PIXIE needs no tight face crops.

Figure 5 compares PIXIE with Zhou et al. [112], a recent work that also estimates a textured face. PIXIE gives more accurate poses (see how the hands and face align to the image), because it fuses both face-body and hand-body expert features, uniquely weighted by the confidence. PIXIE also gives more realistic body shapes, not only due to its gendered shape loss, but also thanks to the shared body, hand and face shape space of SMPL-X. This allows PIXIE’s face expert to uniquely- contribute to whole-body shape.

To verify this, we apply our face expert on face-only images and get the whole-body shapes of Fig. 7. These are not only correctly “gendered”, but also have a plausible BMI. PIXIE is the only 3D whole-body estimation method that explores such face-body shape correlations explicitly.

Small mesh-to-image misalignment a common limitation of regression methods, which typically pool “global” features from the entire image, losing local information. This could be tackled with “pixel-aligned” features [36, 79, 107]. Moreover, SMPL-X models bodies without clothing; adding clothing models [21, 63] is a challenging but promising avenue. Future work could further improve cases with self-contact [12, 26], or other extreme ambiguities. All of these are hard problems, beyond the scope of this paper.

Figure 6: Ablation for the gendered shape prior and the shared shape space between the body and head. From left to right: (1) RGB Image, (2) No shared shape space, (3, 4) PIXIE without (3) / with (4) the gendered shape prior.
Figure 7: Whole-body shape estimation from only our face expert, using SMPL-X’s joint shape space for all body parts.

For many more qualitative results of PIXIE, as well as an extended version of Fig. 4 please see our Sup. Mat.

5 Conclusion

We present PIXIE, a novel expressive whole-body reconstruction method that recovers an animatable 3D avatar with a detailed face from a single RGB image. PIXIE uses body-driven attention to leverage dedicated body, head and face experts. It learns a novel moderator that reasons about the confidence of each expert, to fuse their features according to confidence, and exploit their complementary strengths. It uses the best practices from the face community for accurate faces with realistic albedo and geometric details. Uniquely, the face expert contributes to more realistic whole-body shapes, by using a shared face-body shape space. To further improve shape, PIXIE uses implicit reasoning about gender, to encourage likely “gendered” body shapes. Qualitative results show natural and expressive humans, with improved body shape, well articulated hands, and realistic faces, comparable to the best face-only methods. We believe that PIXIE will be useful for many applications that need expressive human understanding from images.

Disclosure: MJB has received research gift funds from Adobe, Intel, Nvidia, Facebook, and Amazon. While MJB is a part-time employee of Amazon, his research was performed solely at, and funded solely by, MPI. MJB has financial interests in Amazon, Datagen Technologies, and Meshcapade GmbH.
Acknowledgments: We thank Victoria Fernández Abrevaya, Yinghao Huang, Yuliang Xiu and Radek Danecek for discussions. This work was partially supported by the Max Planck ETH Center for Learning Systems.


  • [1] 3DPeople.
  • [2] AXYZ.
  • [3] HDRI Haven.
  • [4] HumanAlloy.
  • [5] RenderPeople.
  • [6] Unreal Engine.
  • [7] Unreal Engine Marketplace.
  • [8] Ankur Agarwal and Bill Triggs. Recovering 3D human pose from monocular images. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 28(1):44–58, 2006.
  • [9] Ijaz Akhter and Michael J. Black. Pose-conditioned joint angle limits for 3D human pose reconstruction. In Computer Vision and Pattern Recognition (CVPR), pages 1446–1455, 2015.
  • [10] Oswald Aldrian and William AP Smith. Inverse rendering of faces with a 3D morphable model. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(5):1080–1093, 2013.
  • [11] Anonymous.

    AGORA: Avatars in Geography Optimized for Regression Analysis.

    In preparation, 2021.
  • [12] Anonymous. On self contact and human pose. In preparation, 2021.
  • [13] Seungryul Baek, Kwang In Kim, and Tae-Kyun Kim. Pushing the envelope for RGB-based dense 3D hand pose estimation via neural rendering. In Computer Vision and Pattern Recognition (CVPR), pages 1067–1076, 2019.
  • [14] Anil Bas, William A. P. Smith, Timo Bolkart, and Stefanie Wuhrer. Fitting a 3D morphable model to edges: A comparison between hard and soft correspondences. In Asian Conference on Computer Vision Workshops (ACCVw), pages 377–391, 2017.
  • [15] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In European Conference on Computer Vision (ECCV), pages 561–578, 2016.
  • [16] Adnane Boukhayma, Rodrigo de Bem, and Philip H.S. Torr. 3D hand shape and pose from images in the wild. In Computer Vision and Pattern Recognition (CVPR), pages 10843–10852, 2019.
  • [17] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230,000 3D facial landmarks). In International Conference on Computer Vision (ICCV), pages 1021–1030, 2017.
  • [18] Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. VGGFace2: A dataset for recognising faces across pose and age. In International Conference on Automatic Face & Gesture Recognition (FG), pages 67–74, 2018.
  • [19] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(1):172–186, 2021.
  • [20] Vasileios Choutas, Georgios Pavlakos, Timo Bolkart, Dimitrios Tzionas, and Michael J. Black. Monocular expressive body regression through body-driven attention. In European Conference on Computer Vision (ECCV), pages 20–40, 2020.
  • [21] Enric Corona, Albert Pumarola, Guillem Alenyà, Gerard Pons-Moll, and Francesc Moreno-Noguer. SMPLicit: Topology-aware generative model for clothed people. In Computer Vision and Pattern Recognition (CVPR), 2021.
  • [22] Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong.

    Accurate 3D face reconstruction with weakly-supervised learning: From single image to image set.

    In Computer Vision and Pattern Recognition Workshops (CVPRw), pages 285–295, 2019.
  • [23] Bernhard Egger, William A. P. Smith, Ayush Tewari, Stefanie Wuhrer, Michael Zollhoefer, Thabo Beeler, Florian Bernard, Timo Bolkart, Adam Kortylewski, Sami Romdhani, Christian Theobalt, Volker Blanz, and Thomas Vetter. 3D morphable face models - past, present and future. Transactions on Graphics (TOG), 39(5):157:1–157:38, 2020.
  • [24] Yao Feng, Haiwen Feng, Michael J. Black, and Timo Bolkart. Learning an animatable detailed 3D face model from in-the-wild images. Transactions on Graphics (TOG), 40(4):88:1–88:13, 2021.
  • [25] Yao Feng, Fan Wu, Xiaohu Shao, Yanfeng Wang, and Xi Zhou. Joint 3D face reconstruction and dense alignment with position map regression network. In European Conference on Computer Vision (ECCV), pages 557–574, 2018.
  • [26] Mihai Fieraru, Mihai Zanfir, Elisabeta Oneata, Alin-Ionut Popa, Vlad Olaru, and Cristian Sminchisescu. Learning complex 3D human Self-Contact. In

    AAAI Conference on Artificial Intelligence

    , 2021.
  • [27] Mihai Fieraru, Mihai Zanfir, Elisabeta Oneata, Alin-Ionut Popa, Vlad Olaru, and Cristian Sminchisescu. Three-dimensional reconstruction of human interactions. In Computer Vision and Pattern Recognition (CVPR), pages 7212–7221, 2020.
  • [28] Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM (CACM), 24(6):381–395, 1981.
  • [29] Valentin Gabeur, Jean-Sebastien Franco, Xavier Martin, Cordelia Schmid, and Gregory Rogez. Moulding Humans: Non-parametric 3D human shape estimation from single images. In International Conference on Computer Vision (ICCV), pages 2232–2241, 2019.
  • [30] Alessio Gallucci, Dmitry Znamenskiy, and Milan Petkovic. Prediction of 3D body parts from face shape and anthropometric measurements. Journal of Image and Graphics, 8(3), 2020.
  • [31] Liuhao Ge, Zhou Ren, Yuncheng Li, Zehao Xue, Yingying Wang, Jianfei Cai, and Junsong Yuan. 3D hand shape and pose estimation from a single RGB image. In Computer Vision and Pattern Recognition (CVPR), pages 10833–10842, 2019.
  • [32] Baris Gecer, Stylianos Ploumpis, Irene Kotsia, and Stefanos Zafeiriou.

    GANFIT: Generative adversarial network fitting for high fidelity 3D face reconstruction.

    In Computer Vision and Pattern Recognition (CVPR), pages 1155–1164, 2019.
  • [33] Kyle Genova, Forrester Cole, Aaron Maschinot, Aaron Sarna, Daniel Vlasic, and William T. Freeman. Unsupervised training for 3D morphable model regression. In Computer Vision and Pattern Recognition (CVPR), pages 8377–8386, 2018.
  • [34] Kristen Grauman, Gregory Shakhnarovich, and Trevor Darrell. Inferring 3D structure with a statistical image-based shape model. In International Conference on Computer Vision (ICCV), pages 641–648, 2003.
  • [35] Peng Guan, Alexander Weiss, Alexandru Balan, and Michael J. Black. Estimating human shape and pose from a single image. In International Conference on Computer Vision (ICCV), pages 1381–1388, 2009.
  • [36] Riza Alp Güler and Iasonas Kokkinos. HoloPose: Holistic 3D human reconstruction In-The-Wild. In Computer Vision and Pattern Recognition (CVPR), pages 10876–10886, 2019.
  • [37] Semih Gunel, Helge Rhodin, and Pascal Fua. What face and body shapes can tell us about height. In International Conference on Computer Vision Workshops (ICCVw), pages 1819–1827, 2019.
  • [38] Jianzhu Guo, Xiangyu Zhu, Yang Yang, Fan Yang, Zhen Lei, and Stan Z Li. Towards fast, accurate and stable 3D dense face alignment. In European Conference on Computer Vision (ECCV), volume 12364, pages 152–168, 2020.
  • [39] Shreyas Hampali, Mahdi Rad, Markus Oberweger, and Vincent Lepetit. HOnnotate: A method for 3D annotation of hand and object poses. In Computer Vision and Pattern Recognition (CVPR), pages 3193–3203, 2020.
  • [40] Yana Hasson, Gül Varol, Dimitrios Tzionas, Igor Kalevatykh, Michael J. Black, Ivan Laptev, and Cordelia Schmid. Learning joint reconstruction of hands and manipulated objects. In Computer Vision and Pattern Recognition (CVPR), pages 11807–11816, 2019.
  • [41] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • [42] Yinghao Huang, Federica Bogo, Christoph Lassner, Angjoo Kanazawa, Peter V. Gehler, Javier Romero, Ijaz Akhter, and Michael J. Black. Towards accurate marker-less human shape and pose estimation over time. In International Conference on 3D Vision (3DV), pages 421–430, 2017.
  • [43] Umar Iqbal, Pavlo Molchanov, Thomas Breuel, Juergen Gall, and Jan Kautz. Hand pose estimation via latent 2.5D heatmap regression. In European Conference on Computer Vision (ECCV), pages 125–143, 2018.
  • [44] Aaron S Jackson, Adrian Bulat, Vasileios Argyriou, and Georgios Tzimiropoulos. Large pose 3D face reconstruction from a single image via direct volumetric CNN regression. In International Conference on Computer Vision (ICCV), pages 1031–1039, 2017.
  • [45] Sheng Jin, Lumin Xu, Jin Xu, Can Wang, Wentao Liu, Chen Qian, Wanli Ouyang, and Ping Luo. Whole-body human pose estimation in the wild. In European Conference on Computer Vision (ECCV), pages 196–214, 2020.
  • [46] Sam Johnson and Mark Everingham. Clustered pose and nonlinear appearance models for human pose estimation. In British Machine Vision Conference (BMVC), pages 12.1–12.11, 2010.
  • [47] Hanbyul Joo, Tomas Simon, and Yaser Sheikh. Total capture: A 3D deformation model for tracking faces, hands, and bodies. In Computer Vision and Pattern Recognition (CVPR), pages 8320–8329, 2018.
  • [48] Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In Computer Vision and Pattern Recognition (CVPR), pages 7122–7131, 2018.
  • [49] Angjoo Kanazawa, Jason Y. Zhang, Panna Felsen, and Jitendra Malik. Learning 3D human dynamics from video. In Computer Vision and Pattern Recognition (CVPR), pages 5614–5623, 2019.
  • [50] Davis E. King.

    Dlib-ml: A machine learning toolkit.

    Journal of Machine Learning Research (JMLR), 10:1755–1758, 2009.
  • [51] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
  • [52] Enes Kocabey, Mustafa Camurcu, Ferda Ofli, Yusuf Aytar, Javier Marín, Antonio Torralba, and Ingmar Weber. Face-to-bmi: Using computer vision to infer body mass index on social media. In International Conference on Web and Social Media (ICWSM), pages 572–575, 2017.
  • [53] Nikos Kolotouros, Georgios Pavlakos, Michael J. Black, and Kostas Daniilidis. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In International Conference on Computer Vision (ICCV), pages 2252–2261, 2019.
  • [54] Nikos Kolotouros, Georgios Pavlakos, and Kostas Daniilidis. Convolutional mesh regression for single-image human shape reconstruction. In Computer Vision and Pattern Recognition (CVPR), pages 4501–4510, 2019.
  • [55] Dominik Kulon, Riza Alp Güler, Iasonas Kokkinos, Michael M. Bronstein, and Stefanos Zafeiriou. Weakly-supervised mesh-convolutional hand reconstruction in the wild. In Computer Vision and Pattern Recognition (CVPR), pages 4989–4999, 2020.
  • [56] Dominik Kulon, Haoyang Wang, Riza Alp Güler, Michael M. Bronstein, and Stefanos Zafeiriou. Single image 3D hand reconstruction with mesh convolutions. In British Machine Vision Conference (BMVC), 2019.
  • [57] Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J. Black, and Peter V. Gehler. Unite the People: Closing the Loop Between 3D and 2D Human Representations. In Computer Vision and Pattern Recognition (CVPR), pages 4704–4713, 2017.
  • [58] Hsi-Jian Lee and Zen Chen. Determination of 3D human body postures from a single view. Computer Vision, Graphics, and Image Processing, 30(2):148–168, 1985.
  • [59] Sijin Li, Weichen Zhang, and Antoni B Chan. Maximum-margin structured learning with deep networks for 3D human pose estimation. In International Conference on Computer Vision (ICCV), pages 2848–2856, 2015.
  • [60] Tianye Li, Timo Bolkart, Michael J. Black, Hao Li, and Javier Romero. Learning a model of facial shape and expression from 4D scans. Transactions on Graphics (TOG), 36(6):194:1–194:17, 2017.
  • [61] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A Skinned Multi-Person Linear Model. Transactions on Graphics (TOG), 34(6):248:1–248:16, 2015.
  • [62] Matthew M. Loper and Michael J. Black. OpenDR: An approximate differentiable renderer. In European Conference on Computer Vision (ECCV), pages 154–169, 2014.
  • [63] Qianli Ma, Jinlong Yang, Anurag Ranjan, Sergi Pujades, Gerard Pons-Moll, Siyu Tang, and Michael J. Black. Learning to dress 3D people in generative clothing. In Computer Vision and Pattern Recognition (CVPR), pages 6468–6477, 2020.
  • [64] Julieta Martinez, Rayat Hossain, Javier Romero, and James J. Little. A simple yet effective baseline for 3D human pose estimation. In International Conference on Computer Vision (ICCV), pages 2659–2668, 2017.
  • [65] Franziska Mueller, Florian Bernard, Oleksandr Sotnychenko, Dushyant Mehta, Srinath Sridhar, Dan Casas, and Christian Theobalt. GANerated hands for real-time 3D hand tracking from monocular RGB. In Computer Vision and Pattern Recognition (CVPR), pages 49–59, 2018.
  • [66] Yuval Nirkin, Iacopo Masi, Anh Tran Tuan, Tal Hassner, and Gerard Medioni. On face segmentation, face swapping, and face perception. In International Conference on Automatic Face & Gesture Recognition (FG), pages 98–105, 2018.
  • [67] Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Peter V. Gehler, and Bernt Schiele. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In International Conference on 3D Vision (3DV), pages 484–494, 2018.
  • [68] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-performance deep learning library. In Conference on Neural Information Processing Systems (NeurIPS), pages 8024–8035, 2019.
  • [69] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
  • [70] Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpanis, and Kostas Daniilidis. Coarse-to-fine volumetric prediction for single-image 3D human pose. In Computer Vision and Pattern Recognition (CVPR), pages 1263–1272, 2017.
  • [71] Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. Learning to estimate 3D human pose and shape from a single color image. In Computer Vision and Pattern Recognition (CVPR), pages 459–468, 2018.
  • [72] Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. Accelerating 3D deep learning with PyTorch3D. arXiv:2007.08501, 2020.
  • [73] Kathleen M. Robinette, Sherri Blackwell, Hein Daanen, Mark Boehmer, Scott Fleming, Tina Brill, David Hoeferlin, and Dennis Burnsides. Civilian American and European Surface Anthropometry Resource (CAESAR) final report. Technical Report AFRL-HE-WP-TR-2002-0169, US Air Force Research Laboratory, 2002.
  • [74] Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bodies together. Transactions on Graphics (TOG), 36(6):245:1–245:17, 2017.
  • [75] Yu Rong, Ziwei Liu, Cheng Li, Kaidi Cao, and Chen Change Loy. Delving deep into hybrid annotations for 3D human recovery in the wild. In International Conference on Computer Vision (ICCV), pages 5339–5347, 2019.
  • [76] Yu Rong, Takaaki Shiratori, and Hanbyul Joo. FrankMocap: Fast monocular 3D hand and body motion capture by regression and integration. arXiv:2008.08324, 2020.
  • [77] Rasmus Rothe, Radu Timofte, and Luc Van Gool. DEX: Deep expectation of apparent age from a single image. In International Conference on Computer Vision Workshops (ICCVw), pages 252–257, 2015.
  • [78] Nadine Rueegg, Christoph Lassner, Michael J. Black, and Konrad Schindler. Chained representation cycling: Learning to estimate 3D human pose and shape by cycling between representations. In AAAI Conference on Artificial Intelligence, pages 5561–5569, 2020.
  • [79] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. PIFu: Pixel-aligned implicit function for high-resolution clothed human digitization. In International Conference on Computer Vision (ICCV), pages 2304–2314, 2019.
  • [80] Shunsuke Saito, Tomas Simon, Jason Saragih, and Hanbyul Joo. PIFuHD: Multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In Computer Vision and Pattern Recognition (CVPR), pages 81–90, 2020.
  • [81] Soubhik Sanyal, Timo Bolkart, Haiwen Feng, and Michael J. Black. Learning to regress 3D face shape and expression from an image without 3D supervision. In Computer Vision and Pattern Recognition (CVPR), pages 7763–7772, 2019.
  • [82] Leonid Sigal and Michael J Black. Predicting 3D people from 2D pictures. In International Conference on Articulated Motion and Deformable Objects (AMDO), pages 185–195, 2006.
  • [83] Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. Hand keypoint detection in single images using multiview bootstrapping. In Computer Vision and Pattern Recognition (CVPR), pages 4645–4653, 2017.
  • [84] David Smith, Matthew Loper, Xiaochen Hu, Paris Mavroidis, and Javier Romero. FACSIMILE: Fast and accurate scans from an image in less than a second. In International Conference on Computer Vision (ICCV), pages 5329–5338, 2019.
  • [85] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In Computer Vision and Pattern Recognition (CVPR), pages 5693–5703, 2019.
  • [86] Xiao Sun, Jiaxiang Shang, Shuang Liang, and Yichen Wei. Compositional human pose regression. In International Conference on Computer Vision (ICCV), pages 2621–2630, 2017.
  • [87] Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. Integral human pose regression. In European Conference on Computer Vision (ECCV), pages 536–553, 2018.
  • [88] Bugra Tekin, Federica Bogo, and Marc Pollefeys. H+O: Unified egocentric recognition of 3D hand-object poses and interactions. In Computer Vision and Pattern Recognition (CVPR), pages 4511–4520, 2019.
  • [89] Bugra Tekin, Isinsu Katircioglu, Mathieu Salzmann, Vincent Lepetit, and Pascal Fua.

    Structured prediction of 3D human pose with deep neural networks.

    In British Machine Vision Conference (BMVC), 2016.
  • [90] Ayush Tewari, Michael Zollhöfer, Pablo Garrido, Florian Bernard, Hyeongwoo Kim, Patrick Pérez, and Christian Theobalt. Self-supervised multi-level face model learning for monocular reconstruction at over 250 Hz. In Computer Vision and Pattern Recognition (CVPR), pages 2549–2559, 2018.
  • [91] Ayush Tewari, Michael Zollhöfer, Hyeongwoo Kim, Pablo Garrido, Florian Bernard, Patrick Perez, and Christian Theobalt.

    MoFA: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction.

    In International Conference on Computer Vision (ICCV), pages 3735–3744, 2017.
  • [92] Justus Thies, Michael Zollhöfer, Marc Stamminger, Christian Theobalt, and Matthias Nießner. Face2Face: Real-time face capture and reenactment of RGB videos. In Computer Vision and Pattern Recognition (CVPR), pages 2387–2395, 2016.
  • [93] Denis Tome, Chris Russell, and Lourdes Agapito. Lifting from the deep: Convolutional 3D pose estimation from a single image. In Computer Vision and Pattern Recognition (CVPR), pages 5689–5698, 2017.
  • [94] Luan Tran, Feng Liu, and Xiaoming Liu. Towards high-fidelity nonlinear 3D face morphable model. In Computer Vision and Pattern Recognition (CVPR), pages 1126–1135, 2019.
  • [95] Anh Tuan Tran, Tal Hassner, Iacopo Masi, and Gerard Medioni. Regressing Robust and Discriminative 3D Morphable Models with a Very Deep Neural Network. In Computer Vision and Pattern Recognition (CVPR), pages 1493–1502, 2017.
  • [96] Gul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia Schmid. BodyNet: Volumetric inference of 3D human body shapes. In European Conference on Computer Vision (ECCV), pages 20–38, 2018.
  • [97] Thomas Vetter and Volker Blanz. Estimating coloured 3D face models from single images: An example based approach. In European Conference on Computer Vision (ECCV), pages 499–513, 1998.
  • [98] Timo von Marcard, Roberto Henschel, Michael J. Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In European Conference on Computer Vision (ECCV), pages 614–631, 2018.
  • [99] Yangang Wang, Cong Peng, and Yebin Liu. Mask-pose cascaded CNN for 2D hand pose estimation from single color images. Transactions on Circuits and Systems for Video Technology (TCSVT), 29(11):3258–3268, 2019.
  • [100] Philippe Weinzaepfel, Romain Brégier, Hadrien Combaluzier, Vincent Leroy, and Grégory Rogez. DOPE: Distillation of part experts for whole-body 3D pose estimation in the wild. In European Conference on Computer Vision (ECCV), pages 380–397, 2020.
  • [101] Donglai Xiang, Hanbyul Joo, and Yaser Sheikh. Monocular total capture: Posing face, body, and hands in the wild. In Computer Vision and Pattern Recognition (CVPR), pages 10965–10974, 2019.
  • [102] Hongyi Xu, Eduard Gabriel Bazavan, Andrei Zanfir, William T. Freeman, Rahul Sukthankar, and Cristian Sminchisescu. GHUM & GHUML: Generative 3D human shape and articulated pose models. In Computer Vision and Pattern Recognition (CVPR), pages 6183–6192, 2020.
  • [103] Haotian Yang, Hao Zhu, Yanru Wang, Mingkai Huang, Qiu Shen, Ruigang Yang, and Xun Cao. FaceScape: A large-scale high quality 3D face dataset and detailed riggable 3D face prediction. In Computer Vision and Pattern Recognition (CVPR), pages 598–607, 2020.
  • [104] Andrei Zanfir, Elisabeta Marinoiu, and Cristian Sminchisescu. Monocular 3D pose and shape estimation of multiple people in natural scenes - the importance of multiple scene constraints. In Computer Vision and Pattern Recognition (CVPR), pages 2148–2157, 2018.
  • [105] Wang Zeng, Wanli Ouyang, Ping Luo, Wentao Liu, and Xiaogang Wang. 3D human mesh regression with dense correspondence. In Computer Vision and Pattern Recognition (CVPR), pages 7052–7061, 2020.
  • [106] Chao Zhang, Sergi Pujades, Michael Black, and Gerard Pons-Moll. Detailed, accurate, human shape estimation from clothed 3D scan sequences. In Computer Vision and Pattern Recognition (CVPR), pages 5484–5493, 2017.
  • [107] Hongwen Zhang, Jie Cao, Guo Lu, Wanli Ouyang, and Zhenan Sun. Learning 3D human shape and pose from dense body parts. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020.
  • [108] Xiong Zhang, Qiang Li, Hong Mo, Wenbo Zhang, and Wen Zheng. End-to-end hand mesh recovery from a monocular RGB image. In International Conference on Computer Vision (ICCV), pages 2354–2364, 2019.
  • [109] Long Zhao, Xi Peng, Yu Tian, Mubbasir Kapadia, and Dimitris N. Metaxas. Semantic graph convolutional networks for 3D human pose regression. In Computer Vision and Pattern Recognition (CVPR), pages 3425–3435, 2019.
  • [110] Zerong Zheng, Tao Yu, Yixuan Wei, Qionghai Dai, and Yebin Liu. DeepHuman: 3D human reconstruction from a single image. In International Conference on Computer Vision (ICCV), pages 7738–7748, 2019.
  • [111] Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. In Computer Vision and Pattern Recognition (CVPR), pages 5745–5753, 2019.
  • [112] Yuxiao Zhou, Marc Habermann, Ikhsanul Habibie, Ayush Tewari, Christian Theobalt, and Feng Xu. Monocular Real-time Full Body Capture with Inter-part Correlations. In Computer Vision and Pattern Recognition (CVPR), 2021.
  • [113] Christian Zimmermann and Thomas Brox. Learning to estimate 3D hand pose from single RGB images. In International Conference on Computer Vision (ICCV), pages 4913–4921, 2017.
  • [114] Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan Russell, Max Argus, and Thomas Brox. FreiHAND: A dataset for markerless capture of hand pose and shape from single RGB images. In International Conference on Computer Vision (ICCV), pages 813–822, 2019.

Appendix A Implementation Details

Data augmentation: For training data, we use image crops around the body, face and hands. We augment our training image crops, following mainly [20], as described below. First, we use standard techniques, namely random horizontal flipping, random image rotations, color noise addition and random translation of the crop’s center. However this is not enough, as there is a significant domain gap between face-only and hand-only datasets, and the respective image crops extracted from full-body images; the former have significantly higher resolution. To account for this, we also randomly down-sample and up-sample the head and hand image crops, to simulate various lower resolutions. Finally, inspired by [76], we add synthetic motion blur to face and hand crops, to simulate the motion blur that is common in full-body images. Exact augmentation parameters can be found in our code website.

Training details: We use PyTorch [68] to implement our pipeline. We follow a three-step training procedure: (1) We pre-train the model with body-only, face-only and hand-only datasets; for each dataset we train only the respective parameters. Since these datasets are captured independently, there is no body image that corresponds to a face-only or hand-only image. Consequently, for this step we cannot apply feature fusion, and body-part features go directly to the respective regressor(s) (bypassing the moderators), to estimate the respective body-part parameters. Similar to existing work, we train only a right hand regressor; for images of a left hand, we flip the image horizontally to use the right hand regressor, and mirror the predictions to get a left hand. (2) Then, using the same data, we freeze the feature encoders and proceed with training the regressors and extractors (see Fig. of the paper the linear layers and between the body encoder and moderators and respectively). This step encourages features and from body images to be in the same space as features and from part-only images, so that regressors and work for both feature types. (3) Finally, we train the full network, including the moderators and , but this time using training images with full SMPL-X ground truth, to extract part crops from full-body images as well. However, there are two problems. First, for these images there is no skin mask available, consequently we remove the loss for body shape and do not apply a photometric and identity loss on head crops. Second, localizing the hands with body-driven attention is much harder compared to the head, due to the longer kinematic chain, consequently we freeze the hand regressor to avoid fine-tuning it with invalid inputs.

All parameters are optimized using Adam [51] with a learning rate of . For training the body, hand and face sub-networks, we use a batch size of , , and , respectively. All input images are resized to pixels before feeding them to our network.

Global to relative pose: The regressors and estimate the absolute head and wrist orientation , i.e. irrespective of the (parent) main body’s pose. However, to “apply” these estimates on a SMPL-X body that is already posed by with (up to the wrist and neck, excluding them), we need to express them relative to their parent in the kinematic skeleton:


where is the chain transformation function according to SMPL-X’s kinematic skeleton hierarchy.

Appendix B Evaluation

b.1 Qualitative Evaluation

Comparison with MTC: In Fig. A.1 we compare PIXIE with MTC [101]. PIXIE is two orders of magnitude faster and predicts more accurate 3D body shapes. However, when 2D joint estimations are accurate, optimization-based methods, such as MTC [101] and SMPLify-X [69], tend to estimate bodies that are better aligned with the image.

Expressive body reconstruction: We compare our method, PIXIE, with other state-of-the-art expressive body reconstruction methods in Fig. A.2. PIXIE avoids the mistakes made by naive “copy-and-paste” fusion thanks to its moderator mechanism. It utilizes a gender shape prior during training and learns to implicitly reason about the gender of each person, predicting more visually plausible shapes within a gender-neutral SMPL-X shape space. 3D head pose, shape and expression directly benefit from face-specific losses, widely used in the 3D face community, producing results that are more accurate than the baselines [20, 76].

Qualitative results: Finally, in Fig. A.3, A.4 and A.5 we provide more standalone PIXIE results. Overall, PIXIE produces visually plausible body shapes with detailed facial expressions.

Failure cases: Although the gender prior loss and the shared whole-body shape space result in better 3D shape predictions, they are not sufficient for perfectly estimating full-body 3D shape. Furthermore, the employed photometric term often causes the model to prefer to explain image evidence using lighting, rather than albedo, which leads to incorrect skin tone predictions. These points highlight important directions for improving PIXIE. Representative failure cases can be seen in Fig. A.6.

Figure A.1: Qualitative PIXIE results and comparison to MTC [101]. From left to right: (1) RGB image, (2) MTC [101], (3) PIXIE, (4) PIXIE with facial geometric details, (5) PIXIE with estimated face albedo and lighting. Overall, PIXIE produces more visually plausible body shapes and more detailed facial expressions.
Figure A.2: Qualitative PIXIE results and comparison to ExPose [20] and FrankMocap [76]. From left to right: (1) RGB image, (2) ExPose [20], (3) FrankMocap [76], (4) PIXIE 3D expressive body predictions, (5) PIXIE with estimated albedo and lighting. Our model improves upon existing expressive body estimation methods, by estimating (i) more visually plausible body shapes, through implicit reasoning about gender, and (ii) better facial expressions and details, by leveraging insights from the 3D face community.
Figure A.3: Qualitative PIXIE results. From left to right: (1) RGB image, (2) PIXIE, (3) PIXIE with facial geometric details, (4) PIXIE with estimated face albedo and lighting.
Figure A.4: Qualitative PIXIE results. From left to right: (1) RGB image, (2) PIXIE, (3) PIXIE with facial geometric details, (4) PIXIE with estimated face albedo and lighting.
Figure A.5: Qualitative PIXIE results. From left to right: (1) RGB image, (2) PIXIE, (3) PIXIE with facial geometric details, (4) PIXIE with estimated face albedo and lighting.
Figure A.6: Failure cases for PIXIE. In these examples, the implicit reasoning about gender and the face shape information are not enough to correctly infer the body shape. Furthermore, due to the formulation of the photometric term the model prefers to explain image evidence using lighting, rather than albedo, which leads to wrong skin tone predictions. Finally, replacing the weak-perspective camera with a perspective model would make the model more robust to extreme viewing angles and perspective distortion effects. Future work should look into denser forms of supervision, formulating a better photometric term and integrating a perspective camera to resolve these issues.