HuMoR: 3D Human Motion Model for Robust Pose Estimation

by   Davis Rempe, et al.

We introduce HuMoR: a 3D Human Motion Model for Robust Estimation of temporal pose and shape. Though substantial progress has been made in estimating 3D human motion and shape from dynamic observations, recovering plausible pose sequences in the presence of noise and occlusions remains a challenge. For this purpose, we propose an expressive generative model in the form of a conditional variational autoencoder, which learns a distribution of the change in pose at each step of a motion sequence. Furthermore, we introduce a flexible optimization-based approach that leverages HuMoR as a motion prior to robustly estimate plausible pose and shape from ambiguous observations. Through extensive evaluations, we demonstrate that our model generalizes to diverse motions and body shapes after training on a large motion capture dataset, and enables motion reconstruction from multiple input modalities including 3D keypoints and RGB(-D) videos.



There are no comments yet.


page 1

page 7

page 8

page 15

page 21

page 22

page 23


VIBE: Video Inference for Human Body Pose and Shape Estimation

Human motion is fundamental to understanding behavior. Despite progress ...

Task-Generic Hierarchical Human Motion Prior using VAEs

A deep generative model that describes human motions can benefit a wide ...

Learning Variations in Human Motion via Mix-and-Match Perturbation

Human motion prediction is a stochastic process: Given an observed seque...

Robust 3D Human Motion Reconstruction Via Dynamic Template Construction

In multi-view human body capture systems, the recovered 3D geometry or e...

Action2video: Generating Videos of Human 3D Actions

We aim to tackle the interesting yet challenging problem of generating v...

3D Human Motion Estimation via Motion Compression and Refinement

We develop a technique for generating smooth and accurate 3D human pose ...

SAMP: Shape and Motion Priors for 4D Vehicle Reconstruction

Inferring the pose and shape of vehicles in 3D from a movable platform s...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As humans, we are constantly moving in, interacting with, and manipulating the world around us. Thus, applications such as action recognition [95, 96]

or holistic dynamic indoor scene understanding 

[19] require accurate perception of 3D human pose, shape, motion, contacts, and interaction. Extensive previous work has focused on estimating 2D or 3D human pose [15, 63, 64], shape [70, 31, 80], and motion [46] from videos. These are challenging problems due to the large space of articulations, body shape, and appearance variations. Even the best methods struggle to accurately capture a wide variety of motions from varying input modalities, produce noisy or overly-smoothed motions (especially at ground contact, , footskate), and struggle in the presence of significant occlusions (, walking behind a couch as in Fig. 1).

We focus on the problem of building a robust human motion model that can address these challenges. To date, most motion models directly represent sequences of likely poses — , in PCA space [68, 93, 83] or via future-predicting autoregressive processes [88, 92, 74]. However, purely pose-based predictions either make modeling environment interactions and generalization beyond training poses difficult, or quickly diverge from the space of realistic motions. On the other hand, explicit physical dynamics models [76, 51, 82, 75, 14, 13] are resource intensive and require knowledge of unobservable physical quantities. While generative models potentially offer the required flexibility, building an expressive, generalizable and robust model for realistic 3D human motions remains an open problem.

To address this, we introduce a learned, autoregressive, generative model that captures the dynamics of 3D human motion, , how pose changes over time. Rather than describing likely poses, the Human Motion Model for R

obust Estimation (HuMoR) models a probability distribution of possible

pose transitions, formulated as a conditional variational autoencoder [85]. Though not explicitly physics-based, its components correspond to a physical model: the latent space can be interpreted as generalized forces, which are inputs to a dynamics model with numerical integration (the decoder). Moreover, ground contacts are explicitly predicted and used to constrain pose estimation at test time.

After training on the large AMASS motion capture dataset [61], we use HuMoR as a motion prior at test time for 3D human perception from noisy and partial observations across different input modalities such as RGB(-D) video and 2D or 3D joint sequences, as illustrated in Fig. 1 (left). In particular, we introduce a robust test-time optimization (TestOpt) strategy which interacts with HuMoR to estimate the parameters of 3D motion, body shape, the ground plane, and contact points as shown in Fig. 1 (middle/right). This interaction happens in two ways: (i) by parameterizing the motion in the latent space of HuMoR in addition to the physical space of ground / contact and shape, (ii) using HuMoR priors in order to regularize the optimization towards the space of plausible motions. These also allow our model to be seamlessly integrated into TestOpt, leading to a robust temporal pose estimation framework.

Comprehensive evaluations reveal that our method surpasses the state-of-the-art on a variety of visual inputs in terms of accuracy and physical plausibility of motions under partial and severe occlusions. We further demonstrate that our motion model generalizes to diverse motions and body shapes on common generative tasks like sampling and future prediction. In a nutshell, our contributions are:

  • [leftmargin=]

  • HuMoR, a generative 3D human motion prior modeled by a novel conditional VAE which enables expressive and general motion reconstruction and generation,

  • A subsequent robust test-time optimization approach that uses HuMoR as a strong motion prior jointly solving for pose, body shape, and ground plane / contacts,

  • The capability to operate on a variety of inputs, such as RGB(-D) video and 2D/3D joint position sequences, to yield accurate and plausible motions and contacts, exemplified through extensive evaluations.

Our work, more generally, suggests that neural nets for dynamics problems can benefit from architectures that model transitions, allowing control structures that emulate classical physical formulations.

2 Related Work

Muh progress has been made on building methods to recover 3D joint locations [73, 64, 63] or parameterized 3D pose and shape (, SMPL [57]) from observations [94]. We focus primarily on motion and shape estimation.

Learning-Based Estimation

Deep learning approaches have shown success in regressing 3D shape and pose from a single image [47, 41, 71, 30, 29, 103, 20]. This has led to developments in predicting motion (pose sequences) and shape directly from RGB video [42, 105, 81, 87, 22]. Most recently, VIBE [46] uses adversarial training to encourage plausible outputs from a conditional recurrent motion generator. MEVA [60] maps a fixed-length image sequence to the latent space of a pre-trained motion autoencoder. These methods are fast and produce accurate root-relative joint positions for video, but motion is globally inconsistent and they struggle to generalize, , under severe occlusions. Other works have addressed occlusions but only on static images [9, 106, 77, 26]. Our approach resolves difficult occlusions in video and other modalities by producing plausible and expressive motions.

Optimization-Based Estimation

One may directly optimize to more accurately fit to observations (images or 2D pose estimators [15]) using human body models [24, 6, 10]. SMPLify [10] uses the SMPL model [57] to fit pose and shape parameters to 2D keypoints in an image using priors on pose and shape. Later works consider body silhouettes [49] and use a learned variational pose prior [70]. Optimization for motion sequences has been explored by several works [5, 40, 55, 104, 100] which apply simple smoothness priors over time. These produce reasonable estimates when the person is fully visible, but with unrealistic dynamics, , overly smooth motions and footskate.

Some works employ human-environment interaction and contact constraints to improve shape and pose estimation [33, 55, 34] by assuming scene geometry is given. iMapper  [66] recovers both 3D joints and a primitive scene representation from RGB video based on interactions by motion retrieval, which may differ from observations. In contrast, our approach optimizes for pose and shape by using an expressive generative model that produces more natural motions than prior work with realistic ground contact.

Human Motion Models

Early sophisticated motion models for pose tracking used a variety of approaches, including mixtures-of-Gaussians [38], linear embeddings of periodic motion [68, 93, 83], nonlinear embeddings [23]

, and nonlinear autoregressive models

[88, 97, 92, 74]. These methods operate in pose space, and are limited to specific motions. Models based on physics can potentially generalize more accurately [76, 51, 82, 75, 14, 13], while also estimating global pose and environmental interactions. However, general-purpose physics-based models are difficult to learn, computationally intensive at test-time, and often assume full-body visibility to detect contacts [76, 51, 82].

Many motion models have been learned for computer animation [12, 48, 79, 50, 54, 37, 86] including recent recurrent and autoregressive models [32, 28, 35, 101, 52]. These often focus on visual fidelity for a small set of characters and periodic locomotions. Works have begun exploring generating more general motion and body shapes [107, 72, 3, 21], but in the context of short-term future prediction. HuMoR is most similar to Motion VAE [52], however we make crucial contributions to enable generalization to unseen, non-periodic motions on novel body shapes.

Figure 2: HuMoR CVAE Architecture. During training, given the previous state and ground truth current state , the model reconstructs by sampling from the encoder distribution. At test time we can (i) generate the next state from by sampling from the prior distribution and decoding, (ii) infer a latent transition with the encoder, or (iii) evaluate the likelihood of a given with the conditional prior.

3 HuMoR: 3D Human Dynamics Model

The goal of our work is to build an expressive and generalizable generative model of 3D human motion learned from real human motions, and to show that this can be used for robust test-time optimization (TestOpt) of pose and shape. In this section, we first describe the model, HuMoR.

State Representation

We represent the state of a moving person as a matrix composed of a root translation , root orientation in axis-angle form, body pose joint angles and joint positions :


where and denote the root and joint velocities, respectively, giving . Part of the state, , parameterizes the SMPL+H body model [78] which is a differentiable function that maps to body mesh vertices and joints given shape parameters . Our over-parameterization allows for two ways to recover the joints: (i) explicitly from , (ii) implicitly through the SMPL+H map .

Latent Variable Dynamics Model

We are interested in modeling the probability of a time sequence of states


where each state is assumed to be dependent on only the previous one and are learned parameters. Then must capture the plausibility of a transition.

We propose a conditional variational autoencoder (CVAE) which formulates the motion as a latent variable model as shown in Fig. 2. This transition model has a nice physical interpretation, which we detail later, and similar models have shown encouraging results for animation [52]. Following the original CVAE derivation [85], our model contains two main components. First, conditioned on the previous state , the distribution over possible latent variables is described by a learned conditional prior:


which parameterizes a Gaussian distribution with diagonal covariance via a neural network. Intuitively, the latent variable

represents the transition to and should therefore have different distributions given different . For example, an idle person has a large variation of possible next states while a person in midair is on a nearly deterministic trajectory. Learning the conditional prior significantly improves the ability of the CVAE to generalize to diverse motions and empirically stabilizes both training and TestOpt.

Second, conditioned on and , the decoder produces two outputs, and . The change in state defines the output distribution through


We find the additive update improves predictive accuracy compared to direct next-step prediction. The person-ground contact is the probability that each of 8 body joints (left and right toes, heels, knees, and hands) is in contact with the ground at time . Contacts are not part of the input to the conditional prior, only an output of the decoder. The contacts enable environmental constraints in TestOpt, and also encourage more physics-aware learning.

The complete probability model for a transition is then:


Given an initial state , one can sample a motion sequence by alternating between sampling and sampling , from to . This model parallels a conventional stochastic physical model. The conditional prior can be seen as a controller, producing “forces” as a function of state , while the decoder acts like a combined physical dynamics model and Euler integrator of generalized position and velocity in Eq. 4.

Our model has connections to Motion VAE (MVAE) [52], which has recently shown promising results for single-character locomotion by also using a VAE for , however, we find that directly applying it for estimation does not give good results (Sec. 5). We overcome this by learning a conditional prior, modeling the change in state and contacts, and ensuring consistency between the state and body model (Sec. 3.1).

Rollout Function

We use our model to define a deterministic rollout function, which is key to TestOpt. Given an initial state and a sequence of latent transitions , we define a function that deterministically maps the motion “parameters” ( to the resulting state at time . This is done through autoregressive rollout which decodes and integrates at each timestep.

Initial State GMM

We model

with a Gaussian mixture model (GMM) containing

components with weights , so that .

3.1 Training

Our CVAE is trained using pairs of (, ). We consider the usual variational lower bound:


The expectation term measures the reconstruction error of the decoder. The encoder, approximate posterior, is introduced for training and parameterizes a Gaussian distribution . The KL divergence regularizes its output to be near the prior. Therefore, we seek the parameters

that minimize the loss function


over all training pairs in our dataset, where is the lower bound in Sec. 3.1 with weight , and contains additional regularizers.

For a single training pair (, ), the reconstruction loss is computed as from the decoder output with

. Gradients are backpropagated through this sample using the reparameterization trick 

[44]. The regularization loss contains two terms: . The SMPL term uses the output of the body model with the estimated parameters and ground truth shape :


The loss encourages consistency between regressed joints and those of the body model. The contact loss contains two terms. The first supervises ground contact classification with a typical binary cross entropy; the second regularizes joint velocities to be consistent with contacts with and the predicted probability that joint is in ground contact. We set and .

The initial state GMM is trained separately with expectation-maximization on all available states in the same dataset used to train the CVAE.

Implementation Details

To ease learning and improve generalization, our model operates in an aligned canonical coordinate frame at each step. All networks are 4 or 5 layer MLPs with ReLU activations and group normalization 


A common difficulty in training VAEs is posterior collapse [59] – when the learned latent encoding is effectively ignored by the decoder. This problem is exacerbated in CVAEs since the decoder receives additional conditioning [52, 85]. In addition to linearly annealing  [11], we found the conditional prior important to combat collapse. Following [52], we also use scheduled sampling [8] in training to enable long-term generation by making the model robust to its own errors. Additional implementation details are available in the supplementary material.

4 Joint Optimization of Motion and Shape

We next use the space of motion learned by HuMoR as a prior in TestOpt to recover pose and shape from noisy and partial observations while ensuring plausibility.

4.1 Optimization Variables

Given a sequence of observations , either as 2D/3D joints, 3D point cloud, or 3D keypoints, we seek the shape and a sequence of SMPL pose parameters which describe the underlying motion being observed. We parameterize the optimized motion using our CVAE by the initial state and a sequence of latent transitions . Then at (and any intermediate steps) is determined through model rollout using the decoder as previously detailed. Compared to directly optimizing SMPL [5, 10, 40], this motion representation naturally encourages plausibility and is compact in the number of variables. To obtain the transformation between the canonical coordinate frame in which our CVAE is trained and the observation frame used for optimization, we additionally optimize the ground plane of the scene . All together, we simultaneously optimize initial state , a sequence of latent variables , ground , and shape . We assume a static camera whose intrinsics are known.

4.2 Objective & Optimization

Our optimization objective can be formulated as a maximum a-posteriori (MAP) estimate (see supplementary for full derivation) resulting in the following objective which seeks a motion that is plausible under our generative model while closely matching observations:


We next detail each of these terms which are the motion prior, data, and regularization energies. In the following, are weights to determine the contribution of each term.

Motion Prior

This energy measures the likelihood of the latent transitions and initial state under the HuMoR CVAE and GMM. It is where


uses the learned conditional prior and uses the initial state GMM.

Data Term

This term is the only modality-dependent component of our approach, requiring different losses for different inputs: 3D joints, 2D joints, and 3D point clouds. We next specify the individual terms for each modality. All operate on the SMPL joints or mesh vertices obtained through the body model using the current SMPL parameters , which are contained in , and shape . In the simplest case, the observations are 3D joint positions (or keypoints with known correspondences) and our energy is


with . For 2D joint positions, each with a detection confidence , we use a re-projection loss


with the robust Geman-McClure function [10, 25] and the pinhole projection. If an estimated person segmentation mask is available, we use this to ignore spurious 2D joints. Finally, if is a 3D point cloud obtained from a depth map roughly masked around the person of interest, we use the mesh vertices to compute


where is a robust bisquare weight [7] computed based on the Chamfer distance term.


The additional regularization consists of four terms . The first two terms encourage rolled-out motions from the CVAE to be plausible even when the initial state is far from the optimum (early in optimization). The skeleton consistency term uses the joints directly predicted by the decoder during rollout along with the SMPL joints:

with and . The second summation uses bone lengths computed from at each step. The second regularizer ensures consistency between predicted CVAE contacts, the motion, and the environment:

where and is the contact probability output from the model for joint . The contact height term weighted by ensures the -component of contacting joints are within of the floor when in contact (since joints are inside the body) in the canonical frame.

The final two regularizers are priors on the ground and shape. We assume the ground should stay close to initialization . Finally,

should stay near the neutral zero vector similar to 

[33, 70]: .

Initialization & Optimization

We initialize the temporal SMPL parameters and shape with a two-stage optimization using and along with two additional regularization terms. is a pose prior where is the body joint angles represented in the latent space of the VPoser model [70, 33]. The smoothness term with smooths 3D joint positions over time. In this initialization, the global translation and orientation are optimized first, followed by full pose and shape. Finally, the initial latent sequence is computed through inference with the learned CVAE encoder

. Our optimization is implemented in PyTorch 

[69] using L-BFGS and autograd; with batching, a RGB video takes about 5.5 to fit. We provide further details in the supplementary material.

5 Experimental Results

Future Prediction Diversity
Model Contact ADE FDE APD
MVAE [52] - 25.8 50.6 85.4
HuMoR 0.88 21.5 42.1 94.9
HuMoR (Qual) 0.88 22.0 46.3 100.0
Table 1: (Left) Future prediction accuracy for AMASS sequences. Contact classification accuracy, average displacement error (), and final displacement error () are reported. (Right) Sampling diversity over rollouts measured by average pairwise distance ().

We evaluate HuMoR on (i) generative sampling tasks and (ii) as a prior in TestOpt to estimate motion from 3D and RGB(-D) inputs. We encourage viewing the supplementary video to appreciate the qualitative improvement of our approach. Additional dataset and experiment details are available in the supplementary document.

5.1 Datasets

AMASS [61] is a large motion capture database containing diverse motions and body shapes on the SMPL body model. We sub-sample the dataset to 30 Hz and use the recommended training split to train the CVAE and initial state GMM in HuMoR. We evaluate on the held out Transitions and HumanEva [84] subsets (Sec. 5.3 and 5.4).

i3DB [66] contains RGB videos of person-scene interactions involving medium to heavy occlusions. It provides annotated 3D joint positions and a primitive 3D scene reconstruction which we use to fit a ground plane for computing plausibility metrics. We run off-the-shelf 2D pose estimation [15], person segmentation [17], and plane detection [53] models to obtain inputs for our optimization.

PROX [33] contains RGB-D videos of people interacting with indoor environments. We use a subset of the qualitative data to evaluate plausibility metrics using a floor plane fit to the provided ground truth scene mesh. We obtain 2D pose, person masks, and ground plane initialization in the same way as done for i3DB.

5.2 Baselines and Evaluation Metrics

Motion Prior Baselines

We ablate the proposed CVAE to analyze its core components: No Delta directly predicts the next state from the decoder rather than the change in state, No Contacts

does not classify ground contacts,

No does not use SMPL regularization in training, and Standard Prior uses rather than our learned conditional prior. All of these ablated together recovers MVAE [52].

Motion Estimation Baselines

VPoser-t is the initialization phase of our optimization. It uses VPoser [70] and 3D joint smoothing similar to previous works [5, 40, 104]. PROX-(RGB/D) [33] are optimization-based methods which operate on individual frames of RGB and RGB-D videos, respectively. Both assume the full scene mesh is given to enforce contact and penetration constraints. VIBE [46] is a recent learned method to recover shape and pose from video.

Error Metrics

3D positional errors are measured on joints, keypoints, or mesh vertices (Vtx) and compute global mean per-point position error unless otherwise specified. We report positional errors for all (All), occluded (Occ), and visible (Vis) observations separately. Finally, we report binary classification accuracy of the 8 person-ground contacts (Contact) predicted by HuMoR.

Plausibility Metrics

We use additional metrics to measure qualitative motion characteristics that joint errors cannot capture. Smoothness is evaluated by mean per-joint accelerations (Accel)  [42]. Another important indicator of plausibility is ground penetration [76]. We use the true ground plane to compute the frequency (Freq) of foot-floor penetrations: the fraction of frames for both the left and right toe joints that penetrate more than a threshold. We measure frequency at 0, 3, 6, 9, 12, and thresholds and report the mean. We also report mean penetration distance (Dist), where non-penetrating frames contribute a distance of 0 to make values comparable across differing frequencies.

Positional Error Joints Mesh Ground Pen
Method Input Vis Occ All Legs Vtx Contact Accel Freq Dist
VPoser-t Occ Keypoints 0.67 20.76 9.22 21.08 7.95 - 5.71 16.77% 2.28
MVAE [52] Occ Keypoints 2.39 19.15 9.52 16.86 8.90 - 7.12 3.15% 0.30
HuMoR (Ours) Occ Keypoints 1.46 17.40 8.24 15.42 7.56 0.89 5.38 3.31% 0.26
VPoser-t Noisy Joints - - 3.67 4.47 4.98 - 4.61 1.35% 0.07
MVAE [52] Noisy Joints - - 2.68 3.21 4.42 - 6.5 1.75% 0.11
HuMoR (Ours) Noisy Joints - - 2.27 2.61 3.55 0.97 5.23 1.18% 0.05
Table 2: Motion and shape estimation from 3D observations: partially occluded keypoints (top) and noisy joints (bottom). Positional Error () is reported w.r.t. the input modality. Acceleration is and penetration distance in .

5.3 Generative Model Evaluation

We first evaluate HuMoR as a standalone generative model and show improved generalization to unseen motions and bodies compared to MVAE for two common tasks (see Table 1): future prediction and diverse sampling. We use AMASS sequences and start generation from the first step. Results are shown for HuMoR and a modified HuMoR (Qual) that uses as input to each step during rollout instead of , thereby enforcing skeleton consistency. This version produces qualitatively superior results for generation, but is too expensive to use during TestOpt.

For prediction, we report average displacement error (ADE) and final displacement error (FDE[102], which measure mean joint errors over all steps and at the final step, respectively. We sample 50 motions for each initial state and the one with lowest ADE is considered the prediction. For diversity, we sample 50 motions and compute the average pairwise distance (APD[4], the mean joint distance between all pairs of samples.

As seen in Tab. 1, the base MVAE [52] does not generalize well when trained on the large AMASS dataset; our proposed CVAE improves both the accuracy and diversity of samples. HuMoR (Qual) hinders prediction accuracy, but gives better diversity and visual quality (see supplement).

Figure 3: Fitting to partial 3D keypoints. HuMoR captures non-periodic motions like jumping, crouching, and kicking.

5.4 Estimation from 3D Observations

Next, we show that HuMoR also generalizes better when used in TestOpt for fitting to 3D data, and that using a motion prior is crucial to plausibly handling occlusions. AMASS sequences are used to demonstrate key abilities: (i) fitting to partial data and (ii) denoising. For the former, TestOpt fits to 43 keypoints on the body that resemble motion capture markers; keypoints that fall below at each timestep are “occluded”, leaving the legs unobservable at most steps. For denoising, Gaussian noise with standard deviation is added to 3D joint position observations.

Tab. 2 compares to VPoser-t and to using MVAE as the motion prior during optimization rather than HuMoR. We report leg joint errors (toes, ankles, and knees), which are often occluded, separately. The right side of the table reports plausibility metrics. HuMoR gives more accurate poses, especially for occluded keypoints and leg joints. It also estimates smoother motions with fewer and less severe ground penetrations. For denoising, VPoser-t oversmooths which gives the lowest acceleration but least accurate motion. TestOpt with HuMoR gives inherently smooth results while still allowing for necessarily large accelerations to fit dynamic observations. Notably, HuMoR predicts person-ground contact with 97% accuracy even under severe noise. Qualitative results are shown in Fig. 1 and Fig. 3.

Figure 4: From RGB-D (top) TestOpt with HuMoR outputs 3D motion, the ground plane, and contacts (bottom).
Global Joint Error Root-Aligned Joint Error Ground Pen
Method Vis Occ All Legs Vis Occ All Legs Accel Freq Dist
VIBE [46] 90.05 192.55 116.46 121.61 12.06 23.78 15.08 21.65 243.36 7.98% 3.01
VPoser-t 28.33 40.97 31.59 35.06 12.77 26.48 16.31 25.60 4.46 9.28% 2.42
MVAE [52] 37.54 50.63 40.91 44.42 16.00 28.32 19.17 26.63 4.96 7.43% 1.55
No Delta 27.55 35.59 29.62 32.14 11.92 23.10 14.80 21.65 3.05 2.84% 0.58
No Contacts 26.65 39.21 29.89 35.73 12.24 23.36 15.11 22.25 2.43 5.59% 1.70
No 31.09 43.67 34.33 36.84 12.81 25.47 16.07 23.54 3.21 4.12% 1.31
Standard Prior 77.60 146.76 95.42 99.01 18.67 39.40 24.01 34.02 5.98 8.30% 6.47
HuMoR (Ours) 26.00 34.36 28.15 31.26 12.02 21.70 14.51 20.74 2.43 2.12% 0.68
Table 3: Motion and shape from RGB video (2D joints) on i3DB [66]. Joint errors are in and acceleration is . Top shows results from motion estimation baselines while bottom uses ablations of HuMoR during optimization.
Figure 5: Qualitative comparison for fitting to RGB video (2D joints) from i3DB [66]. Optimization using HuMoR (Ours) outputs natural and plausible sitting and walking motions under heavy occlusions compared to baseline approaches.
Ground Pen
Method Input Accel Freq Dist
VIBE [46] RGB 86.06 23.46% 4.71
PROX-RGB [33] RGB 196.07 2.55% 0.32
VPoser-t RGB 3.14 13.38% 2.82
HuMoR (Ours) RGB 1.73 9.99% 1.56
PROX-D [33] RGB-D 46.59 8.95% 1.19
VPoser-t RGB-D 3.27 10.66% 2.18
HuMoR (Ours) RGB-D 1.61 5.19% 0.85
Table 4: Plausibility evaluation on videos in PROX [33]. Acceleration is and penetration distance in .

5.5 Estimation from RGB(-D) Observations

Finally, we show that TestOpt with HuMoR can be applied to real-world RGB and RGB-D observations, and outperforms baselines on positional and plausibility metrics especially from partial and noisy data. We use (90 frame) clips from i3DB [66] and PROX [33]. Tab. 3 shows results on i3DB which affords quantitative 3D joint evaluation. The top half compares to baseline estimation methods; the bottom uses ablations of HuMoR in TestOpt rather than the full model. Mean per-joint position errors are reported for global joint positions and after root alignment.

As seen in Tab. 3, VIBE gives locally accurate predictions for visible joints, but large global errors and unrealistic accelerations due to occlusions and temporal inconsistency (see Fig. 5). VPoser-t gives reasonable global errors, but suffers frequent penetrations as shown for sitting in Fig. 5. Using MVAE or ablations of HuMoR as the motion prior in TestOpt fails to effectively generalize to real-world data and performs worse than the full model. The conditional prior and have the largest impact, while performance even without using contacts still outperforms the baselines.

The top half of Tab. 4 evaluates plausibility on additional RGB results from PROX compared to VIBE and PROX-RGB. Since PROX-RGB uses the scene mesh as input to enforce environment constraints, it is a very strong baseline and its performance on penetration metrics is expectedly good. HuMoR comparatively increases penetration frequency since it only gets a rough ground plane as initialization, but gives much smoother motions.

The bottom half of Tab. 4 shows results fitting to RGB-D for the same PROX data, which uses both and in TestOpt. This improves performance using HuMoR, slightly outperforming PROX-D which is less robust to issues with 2D joint detections and 3D point noise causing large errors. Qualitative examples are in Fig. 1 and Fig. 4.

6 Discussion

We have introduced HuMoR, a learned generative model of 3D human motion leveraged during test-time optimization to robustly recover pose and shape from 3D, RGB, and RGB-D observations. We have demonstrated that the key components of our model enable generalization to novel motions and body shapes for both generative tasks and downstream optimization. Compared to strong learning and optimization-based baselines, HuMoR excels at estimating plausible motion under heavy occlusions, and simultaneously produces consistent ground plane and contact outputs.

Limitations & Future Work

HuMoR leaves ample room for future studies. The static camera and ground plane assumptions are reasonable for indoor scenes but true in-the-wild operation demands methods handling dynamic cameras and complex terrain. Our rather simplistic contact model should be upgraded to capture scene-person interactions for improved motion and scene perception. Lastly, we plan to learn motion estimation directly from partial observations and sample multiple plausible motions rather than relying on a single local minimum.


This work was supported by the Toyota Research Institute under the University 2.0 program, a grant from the Samsung GRO program, a grant from the Ford-Stanford Alliance, a Vannevar Bush faculty fellowship, and NSF grant IIS-1763268. Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.


  • [1] Advanced Computing Center for the Arts and Design. ACCAD MoCap Dataset.
  • [2] Ijaz Akhter and Michael J. Black. Pose-conditioned joint angle limits for 3D human pose reconstruction. In

    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 1446–1455, June 2015.
  • [3] Emre Aksan, Manuel Kaufmann, and Otmar Hilliges. Structured prediction helps 3d human motion modelling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7144–7153, 2019.
  • [4] Sadegh Aliakbarian, Fatemeh Sadat Saleh, Mathieu Salzmann, Lars Petersson, and Stephen Gould. A stochastic conditioning scheme for diverse human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5223–5232, 2020.
  • [5] Anurag Arnab, Carl Doersch, and Andrew Zisserman. Exploiting temporal context for 3d human pose estimation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3395–3404, 2019.
  • [6] Andreas Baak, Meinard Müller, Gaurav Bharaj, Hans-Peter Seidel, and Christian Theobalt. A data-driven approach for real-time full body pose reconstruction from a depth camera. In Consumer Depth Cameras for Computer Vision, pages 71–98. Springer, 2013.
  • [7] Albert E Beaton and John W Tukey. The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data. Technometrics, 16(2):147–185, 1974.
  • [8] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. arXiv preprint arXiv:1506.03099, 2015.
  • [9] Benjamin Biggs, Sébastien Ehrhadt, Hanbyul Joo, Benjamin Graham, Andrea Vedaldi, and David Novotny. 3d multi-bodies: Fitting sets of plausible 3d human models to ambiguous image data. arXiv preprint arXiv:2011.00980, 2020.
  • [10] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Computer Vision – ECCV 2016, Lecture Notes in Computer Science. Springer International Publishing, Oct. 2016.
  • [11] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
  • [12] Matthew Brand and Aaron Hertzmann. Style machines. In ACM SIGGRAPH, pages 183–192, July 2000.
  • [13] Marcus A. Brubaker, David J. Fleet, and Aaron Hertzmann. Physics-based person tracking using the anthropomorphic walker. IJCV, (1), 2010.
  • [14] Marcus A. Brubaker, Leonid Sigal, and David J. Fleet. Estimating contact dynamics. In Proc. ICCV, 2009.
  • [15] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  • [16] Carnegie Mellon University. CMU MoCap Dataset.
  • [17] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  • [18] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. arXiv preprint arXiv:1806.07366, 2018.
  • [19] Yixin Chen, Siyuan Huang, Tao Yuan, Siyuan Qi, Yixin Zhu, and Song-Chun Zhu. Holistic++ scene understanding: Single-view 3d holistic scene parsing and human pose estimation with human-object interaction and physical commonsense. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8648–8657, 2019.
  • [20] Vasileios Choutas, Georgios Pavlakos, Timo Bolkart, Dimitrios Tzionas, and Michael J Black. Monocular expressive body regression through body-driven attention. In European Conference on Computer Vision, pages 20–40. Springer, 2020.
  • [21] Enric Corona, Albert Pumarola, Guillem Alenya, and Francesc Moreno-Noguer. Context-aware human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020.
  • [22] Carl Doersch and Andrew Zisserman. Sim2real transfer learning for 3d human pose estimation: motion to the rescue. arXiv preprint arXiv:1907.02499, 2019.
  • [23] Ahmed Elgammal and Chan-Su Lee. Separating style and content on a nonlinear manifold. In IEEE Conf. Comp. Vis. and Pattern Recognition, pages 478–485, 2004. Vol. 1.
  • [24] Varun Ganapathi, Christian Plagemann, Daphne Koller, and Sebastian Thrun. Real time motion capture using a single time-of-flight camera. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 755–762. IEEE, 2010.
  • [25] S. Geman and D. McClure. Statistical methods for tomographic image reconstruction. Bulletin of the International Statistical Institute, 52(4):5–21, 1987.
  • [26] Georgios Georgakis, Ren Li, Srikrishna Karanam, Terrence Chen, Jana Košecká, and Ziyan Wu. Hierarchical kinematic human mesh recovery. In European Conference on Computer Vision, pages 768–784. Springer, 2020.
  • [27] Saeed Ghorbani, Kimia Mahdaviani, Anne Thaler, Konrad Kording, Douglas James Cook, Gunnar Blohm, and Nikolaus F. Troje. MoVi: A large multipurpose motion and video dataset, 2020.
  • [28] Saeed Ghorbani, Calden Wloka, Ali Etemad, Marcus A Brubaker, and Nikolaus F Troje. Probabilistic character motion synthesis using a hierarchical deep latent variable model. In Computer Graphics Forum, volume 39. Wiley Online Library, 2020.
  • [29] Riza Alp Guler and Iasonas Kokkinos. Holopose: Holistic 3d human reconstruction in-the-wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10884–10894, 2019.
  • [30] Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7297–7306, 2018.
  • [31] Marc Habermann, Weipeng Xu, Michael Zollhofer, Gerard Pons-Moll, and Christian Theobalt. Deepcap: Monocular human performance capture using weak supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5052–5063, 2020.
  • [32] Ikhsanul Habibie, Daniel Holden, Jonathan Schwarz, Joe Yearsley, and Taku Komura. A recurrent variational autoencoder for human motion synthesis. In 28th British Machine Vision Conference, 2017.
  • [33] Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J. Black. Resolving 3D human pose ambiguities with 3D scene constraints. In International Conference on Computer Vision, pages 2282–2292, 2019.
  • [34] Mohamed Hassan, Partha Ghosh, Joachim Tesch, Dimitrios Tzionas, and Michael J. Black. Populating 3d scenes by learning human-scene interaction, 2020.
  • [35] Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. Moglow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics (TOG), 39(6):1–14, 2020.
  • [36] I. Higgins, Loïc Matthey, A. Pal, C. Burgess, Xavier Glorot, M. Botvinick, S. Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
  • [37] Daniel Holden, Taku Komura, and Jun Saito. Phase-functioned neural networks for character control. ACM Transactions on Graphics (TOG), 36(4):1–13, 2017.
  • [38] Nicholas R. Howe, Michael E. Leventon, and William T. Freeman. Bayesian reconstruction of 3D human motion from single-camera video. In Advances in Neural Information Processing Systems 12, pages 820–826, 2000.
  • [39] Ludovic Hoyet, Kenneth Ryall, Rachel McDonnell, and Carol O’Sullivan. Sleight of hand: Perception of finger motion from reduced marker sets. In Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, I3D ’12, page 79–86, 2012.
  • [40] Yinghao Huang, Federica Bogo, Christoph Lassner, Angjoo Kanazawa, Peter V Gehler, Javier Romero, Ijaz Akhter, and Michael J Black. Towards accurate marker-less human shape and pose estimation over time. In 2017 international conference on 3D vision, pages 421–430. IEEE, 2017.
  • [41] Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In Computer Vision and Pattern Regognition, 2018.
  • [42] Angjoo Kanazawa, Jason Y Zhang, Panna Felsen, and Jitendra Malik. Learning 3d human dynamics from video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • [43] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [44] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [45] Ivan Kobyzev, Simon J.D. Prince, and Marcus A. Brubaker. Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • [46] Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. Vibe: Video inference for human body pose and shape estimation. In The IEEE Conference on Computer Vision and Pattern Recognition, June 2020.
  • [47] Nikos Kolotouros, Georgios Pavlakos, Michael J. Black, and Kostas Daniilidis. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In Proceedings International Conference on Computer Vision (ICCV), pages 2252–2261. IEEE, Oct. 2019. ISSN: 2380-7504.
  • [48] Lucas Kovar, Michael Gleicher, and Frédéric Pighin. Motion graphs. In ACM Transactions on Graphics 21(3), Proc. SIGGRAPH, pages 473–482, July 2002.
  • [49] Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J Black, and Peter V Gehler. Unite the people: Closing the loop between 3d and 2d human representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6050–6059, 2017.
  • [50] Yan Li, Tianshu Wang, and Heung-Yeung Shum. Motion texture: A two-level statistical model for character motion synthesis. In ACM Transactions on Graphics 21(3), Proc. SIGGRAPH, pages 465–472, July 2002.
  • [51] Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, and Josef Sivic. Estimating 3d motion and forces of person-object interactions from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
  • [52] Hung Yu Ling, Fabio Zinno, George Cheng, and Michiel van de Panne. Character controllers using motion vaes. In ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH), volume 39. ACM, 2020.
  • [53] Chen Liu, Kihwan Kim, Jinwei Gu, Yasutaka Furukawa, and Jan Kautz. Planercnn: 3d plane detection and reconstruction from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4450–4459, 2019.
  • [54] C. Karen Liu, Aaron Hertzmann, and Zoran Popović. Learning physics-based motion style with nonlinear inverse optimization. ACM Trans. Graph, 2005.
  • [55] Miao Liu, Dexin Yang, Yan Zhang, Zhaopeng Cui, James M Rehg, and Siyu Tang. 4d human body capture from egocentric video via 3d scene grounding. arXiv preprint arXiv:2011.13341, 2020.
  • [56] Matthew Loper, Naureen Mahmood, and Michael J. Black. MoSh: Motion and Shape Capture from Sparse Markers. ACM Trans. Graph., 33(6), Nov. 2014.
  • [57] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, Oct. 2015.
  • [58] Eyes JAPAN Co. Ltd. Eyes Japan MoCap Dataset.
  • [59] James Lucas, George Tucker, Roger B Grosse, and Mohammad Norouzi. Dont blame the elbo! a linear vae perspective on posterior collapse. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • [60] Zhengyi Luo, S Alireza Golestaneh, and Kris M Kitani. 3d human motion estimation via motion compression and refinement. In Proceedings of the Asian Conference on Computer Vision, 2020.
  • [61] Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. In International Conference on Computer Vision, 2019.
  • [62] C. Mandery, Ö. Terlemez, M. Do, N. Vahrenkamp, and T. Asfour. The KIT whole-body human motion database. In 2015 International Conference on Advanced Robotics (ICAR), pages 329–336, July 2015.
  • [63] Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Mohamed Elgharib, Pascal Fua, Hans-Peter Seidel, Helge Rhodin, Gerard Pons-Moll, and Christian Theobalt. Xnect: Real-time multi-person 3d motion capture with a single rgb camera. ACM Transactions on Graphics (TOG), 39(4):82–1, 2020.
  • [64] Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM Transactions on Graphics (TOG), 36(4):1–14, 2017.
  • [65] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [66] Aron Monszpart, Paul Guerrero, Duygu Ceylan, Ersin Yumer, and Niloy J. Mitra. iMapper: Interaction-guided scene mapping from monocular videos. ACM SIGGRAPH, 2019.
  • [67] M. Müller, T. Röder, M. Clausen, B. Eberhardt, B. Krüger, and A. Weber. Documentation mocap database HDM05. Technical Report CG-2007-2, Universität Bonn, 2007.
  • [68] Dirk Ormoneit, Hedvig Sidenbladh, Michael J. Black, and Trevor Hastie. Learning and tracking cyclic human motion. In Advances in Neural Information Processing Systems 13, pages 894–900, 2001.
  • [69] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
  • [70] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
  • [71] Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. Learning to estimate 3d human pose and shape from a single color image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 459–468, 2018.
  • [72] Dario Pavllo, Christoph Feichtenhofer, Michael Auli, and David Grangier. Modeling human motion with quaternion-based neural networks. International Journal of Computer Vision, pages 1–18, 2019.
  • [73] Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7753–7762, 2019.
  • [74] Vladimir Pavlović, James M. Rehg, and John MacCormick. Learning switching linear models of human motion. In Advances in Neural Information Processing Systems 13, pages 981–987, 2001.
  • [75] Xue Bin Peng, Angjoo Kanazawa, Jitendra Malik, Pieter Abbeel, and Sergey Levine.

    Sfv: Reinforcement learning of physical skills from videos.

    ACM Trans. Graph, 2018.
  • [76] Davis Rempe, Leonidas J. Guibas, Aaron Hertzmann, Bryan Russell, Ruben Villegas, and Jimei Yang. Contact and human dynamics from monocular video. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  • [77] Chris Rockwell and David F Fouhey. Full-body awareness from partial observations. arXiv preprint arXiv:2008.06046, 2020.
  • [78] Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), Nov. 2017.
  • [79] Charles Rose, Michael F. Cohen, and Bobby Bodenheimer.

    Verbs and adverbs: Multidimensional motion interpolation.

    IEEE Computer Graphics and Applications, 18(5):32–40, 1998.
  • [80] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  • [81] Mingyi Shi, Kfir Aberman, Andreas Aristidou, Taku Komura, Dani Lischinski, Daniel Cohen-Or, and Baoquan Chen. Motionet: 3d human motion reconstruction from monocular video with skeleton consistency. ACM Transactions on Graphics (TOG), 40(1):1–15, 2020.
  • [82] Soshi Shimada, Vladislav Golyanik, Weipeng Xu, and Christian Theobalt. Physcap: Physically plausible monocular 3d motion capture in real time. ACM Trans. Graph., 39(6), Nov. 2020.
  • [83] Hedvig Sidenbladh, Michael J. Black, and David J. Fleet. Stochastic tracking of 3D human figures using 2D image motion. In ECCV, pages 702–718, 2000. Part II.
  • [84] Leonid Sigal, Alexandru O Balan, and Michael J Black. Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International journal of computer vision, 87(1-2):4, 2010.
  • [85] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28, pages 3483–3491. Curran Associates, Inc., 2015.
  • [86] Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. Neural state machine for character-scene interactions. ACM Trans. Graph., 38(6):209–1, 2019.
  • [87] Yu Sun, Yun Ye, Wu Liu, Wenpeng Gao, Yili Fu, and Tao Mei. Human mesh recovery from monocular images via a skeleton-disentangled representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5349–5358, 2019.
  • [88] Graham W. Taylor, Geoffrey E. Hinton, and Sam T. Roweis. Modeling human motion using binary latent variables. In Proc. NIPS, 2007.
  • [89] Nikolaus F. Troje. Decomposing biological motion: A framework for analysis and synthesis of human gait patterns. Journal of Vision, 2(5):2–2, Sept. 2002.
  • [90] Matthew Trumble, Andrew Gilbert, Charles Malleson, Adrian Hilton, and John Collomosse. Total capture: 3d human pose estimation fusing video and inertial sensors. In Proceedings of the British Machine Vision Conference (BMVC), Sept. 2017.
  • [91] Simon Fraser University and National University of Singapore. SFU Motion Capture Database.
  • [92] Raquel Urtasun, David J. Fleet, and Pascal Fua. 3D people tracking with Gaussian process dynamical models. In IEEE Conf. Comp. Vis. & Pattern Rec., pages 238–245, 2006. Vol. 1.
  • [93] Raquel Urtasun, David J. Fleet, and Pascal Fua. Temporal motion models for monocular and multiview 3D human body tracking. CVIU, 104(2):157–177, 2006.
  • [94] Gül Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia Schmid. BodyNet: Volumetric inference of 3D human body shapes. In ECCV, 2018.
  • [95] Gül Varol, Ivan Laptev, Cordelia Schmid, and Andrew Zisserman. Synthetic humans for action recognition from unseen viewpoints. arXiv preprint arXiv:1912.04070, 2019.
  • [96] Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J Black, Ivan Laptev, and Cordelia Schmid. Learning from synthetic humans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 109–117, 2017.
  • [97] Jack M. Wang. Gaussian process dynamical models for human motion. Master’s thesis, University of Toronto, 2005.
  • [98] Jianqiao Wangni, Dahua Lin, Ji Liu, Kostas Daniilidis, and Jianbo Shi. Towards statistically provable geometric 3d human pose recovery. SIAM Journal on Imaging Sciences, 14(1):246–270, 2021.
  • [99] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
  • [100] Donglai Xiang, Hanbyul Joo, and Yaser Sheikh. Monocular total capture: Posing face, body, and hands in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10974, 2019.
  • [101] Dongseok Yang, Doyeon Kim, and Sung-Hee Lee. Real-time lower-body pose prediction from sparse upper-body tracking signals. arXiv preprint arXiv:2103.01500, 2021.
  • [102] Ye Yuan and Kris Kitani. Dlow: Diversifying latent flows for diverse human motion prediction. In European Conference on Computer Vision, pages 346–364. Springer, 2020.
  • [103] Andrei Zanfir, Eduard Gabriel Bazavan, Hongyi Xu, William T Freeman, Rahul Sukthankar, and Cristian Sminchisescu. Weakly supervised 3d human pose and shape reconstruction with normalizing flows. In European Conference on Computer Vision, pages 465–481. Springer, 2020.
  • [104] Andrei Zanfir, Elisabeta Marinoiu, and Cristian Sminchisescu. Monocular 3d pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2148–2157, 2018.
  • [105] Jason Y Zhang, Panna Felsen, Angjoo Kanazawa, and Jitendra Malik. Predicting 3d human dynamics from video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7114–7123, 2019.
  • [106] Tianshu Zhang, Buzhen Huang, and Yangang Wang. Object-occluded human shape and pose estimation from a single color image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  • [107] Yan Zhang, Michael J. Black, and Siyu Tang. We are more than our joints: Predicting how 3d bodies move, 2020.


Here we provide details and extended evaluations omitted from the main paper111In the rest of this document we refer to the main paper briefly as paper. for brevity. App. A provides extended discussions, App. B and App. C give method details regarding the HuMoR model and test-time optimization (TestOpt), App. D derives our optimization energy from a probabilistic perspective, App. E provides experimental details from the main paper, and App. F contains extended experimental evaluations.

We encourage the reader to view the supplementary videos on the project webpage for extensive qualitative results. We further discuss these results in App. F.

Appendix A Discussions

State Representation

Our state representation is somewhat redundant to include both explicit joint positions and SMPL parameters (which also give joint positions ). This is motivated by recent works [52, 107] which show that using an extrinsic representation of body keypoints (joint positions or mesh vertices) helps in learning motion characteristics like static contact, thereby improving the visual quality of generated motions. The over-parameterization, unique to our approach, additionally allows for consistency losses leveraged during CVAE training and in TestOpt.

Another noteworthy property of our state is that it does not explicitly represent full-body shape – only bone proportions are implicitly encoded through joint locations. During training, we use shape parameters provided in AMASS [61] to compute , but otherwise the CVAE is shape-unaware. Extending our formulation to include full-body shape is an important direction for improved generalization and should be considered in future work.

Conditioning on More Time Steps

Alternatively, we could condition the dynamics learned by the CVAE with additional previous steps, , however since includes velocities this is unnecessary and only increases the chances of overfitting to training motions. It would additionally increases the necessary computation for both generation and TestOpt.

Why CVAE? Our use of a CVAE to model motion is primarily motivated by recent promising results in the graphics community [52, 28]. Not only is it a simple solution, but also affords the physical interpretation presented in the main paper. Other deep generative models could be considered for

, however each have potential issues compared to our CVAE. The conditional generative adversarial network 

[65] would use standard normal noise for , which we show is insufficient in multiple experiments. Furthermore, it does not allow for inferring a latent transition . Past works have had success with recurrent and variational-recurrent architectures [107]. As discussed previously, the reliance of these networks on multiple timesteps increases overfitting which is especially dangerous for our estimation application which requires being able to represent arbitrary observed motions. Finally, normalizing flows [45] and neural ODEs [18] show exciting potential for modeling human motion, however conditional generation with these models is not yet well-developed.

A Note on -Vae [36]

The KL weight in Eq. 7 of the main paper is not directly comparable to a typical -VAE [36] due to various implementation details. First, is the mean-squared error (MSE) of the unnormalized state rather than the true log-likelihood. The use of additional regularizers that are not formulated probabilistically to be part of the reconstruction loss further compounds the difference. Furthermore, in practice losses are averaged over both the feature and batch dimensions as not to depend on chosen dimensionalities. All these differences result in setting .

The Need for in Optimization

The motion prior term , which leverages our learned conditional prior and GMM, nicely falls out of the MAP derivation (see App. D below) and is by itself reasonable to ensure motion is plausible. However, in practice it can be prone to local minima and slow to converge without any regularization. This is primarily because HuMoR is trained on clean motion capture data from AMASS [61], but in the early stages of optimization the initial state will be far from this domain. This means rolled out motions using the CVAE decoder will be implausible and the likelihood output from learned conditional prior is not necessarily meaningful (since inputs will be well outside the training distribution). The additional regularizers presented in the main paper, mainly and , allow us to resolve this issue by reflecting expected behavior of the motion model when it is producing truly plausible motions ( is similar to the training data).

On Evaluation Metrics

As discussed in prior work [76]

, traditional positional metrics used to evaluate root-relative pose estimates do not capture the accuracy of the absolute (“global”) motion nor its physical/perceptual plausibility. This is why we use a range of metrics to capture both the global joint accuracy, local joint accuracy (after aligning root joints), and plausibility of a motion. However, these metrics still have flaws and there is a need to develop more informative motion estimation evaluation metrics for both absolute accuracy and plausibility. This is especially true in scenarios of severe occlusions where there is not a single correct answer: even if the “ground truth” 3D joints are available, there may be multiple motions that explain the partial observations equally well.

On Convergence

Our multi-objective optimization uses a mixture of convex and non-convex loss functions. As we utilize L-BFGS, the minimum energy solution we report is only locally optimal. While simulated annealing or MCMC / HMC (Markov Chain Monte Carlo / Hamiltonian Monte Carlo) type of exploration approaches can be deployed to search for the global optimum, such methods would incur heavy computational load and hence are prohibitive in our setting. Thanks to the accurate initialization, we found that most of the time TestOpt converges to a good minimum. This observation is also lightly supported by recent work arguing that statistically provable convergence can be attained for the human pose problem under convex and non-convex regularization using a multi-stage optimization scheme 


a.1 Assumptions and Limitations

On the Assumption of a Ground Plane

We use the ground during TestOpt to obtain a transformation to the canonical reference frame where our prior is trained. While this is a resonable assumption in a majority of scenarios, we acknowledge that certain applications might require in-the-wild operation where a single ground plane does not exist climbing up stairs or moving over complex terrain. In such scenarios, we require a consistent reference frame, which can be computed from: (i) an accelerometer if a mobile device is used, (ii) pose of static, rigid objects if an object detector is deployed, (iii) fiducial tags or any other means of obtaining a gravity direction.

Note that the ground plane is not an essential piece in the test-time optimization. It is a requirement only because of the way our CVAE is trained: on motions with a ground plane at , gravity in the direction, and without complex terrain interactions. Although we empirically noticed that convergence of training necessitates this assumption, other architectures or the availability of larger in-the-wild motion datasets might make training HuMoR possible under arbitrary poses. This perspective should clarify why our method can work when the ground is invisible: TestOpt might converge from a bad initialization as long as our prior (HuMoR) is able to account for the observation.

On the Assumption of a Static Camera

While a static camera is assumed in all of our evaluations, recent advances in 3D computer vision make it possible to overcome this limitation. Our method, backed by either a structure from motion / SLAM pipeline or a camera relocalization engine, can indeed work in scenarios where the camera moves as well as the human targets. A more sophisticated solution could leverage our learned motion model to disambiguate between camera and human motion. Expectedly, this requires further investigation, making room for future studies as discussed at the end of the main paper.

Figure 6: Failure cases of TestOpt using HuMoR. Please see Sec. A.1 or the supplementary videos for details of each.

Other Limitations and Failure Cases

As discussed in Sec. 6 of the main paper, HuMoR has limitations that motivate multiple future directions. First, optimization is generally slow compared to learning-based (direct prediction) methods. This also reflects on our test-time optimization. Approaches for learning to optimize can come handy in increasing the efficiency of our method. Additionally, our current formulation of TestOpt allows only for a single output, the local optimum. Therefore, future work may explore learned approaches yielding multi-hypothesis output, which can be used to characterize uncertainty.

Specific failure cases (as shown in the supplementary videos and Fig. 6) further highlight areas of future improvement. First, extreme occlusions (e.g. only a few visible points as in Fig. 6 left), especially at the first frame which determines , makes for a difficult optimization that often lands in local minima with implausible motions. Second, uncommon motions that are rare during CVAE training, such as laying down in Fig. 6 (middle), can cause spurious ground plane outputs as TestOpt attempts to make the motion more likely. Leveraging more holistic scene understanding methods and models of human-environment interaction will help in these cases. Finally, our method is dependent on motion in order to resolve ambiguity, which is usually very helpful but has corner cases as shown in Fig. 6 (right). For example, if the observed person is nearly static, the optimization may produce implausible poses due to ambiguous occlusions (standing when really the person is sitting) and/or incorrect ground plane estimations.

Appendix B HuMoR Model Details

In this section, we provide additional implementation details for the HuMoR motion model described in Sec. 3 of the main paper.

b.1 CVAE Architecture and Implementation

Body Model

We use the SMPL+H body model [78] since it is used by the AMASS [61] dataset. However, our focus is on modeling body motion, so HuMoR and TestOpt do not consider the hand joints (leaving the 22 body joints including the root). Hand joints could be straightforwardly optimized with body motion, but was not in our current scope.

Canonical Coordinate Frame

To ease learning and improve generalization, our network operates on inputs in a canonical coordinate frame. Specifically, based on we apply a rotation around the up () axis and translation in such that the and components of are 0 and the person is facing the direction w.r.t. .


The encoder and prior networks are identical multi-layer perceptrons (MLP) with 5 layers and hidden size 1024. The decoder is a 4-layer MLP with hidden sizes (1024, 1024, 512). The latent transition

is skip-connected to every layer of the decoder in order to emphasize its importance and help avoid posterior collapse [52]. ReLU non-linearities and group normalization [99] with 16 groups are used between all layers except outputs in each network. Input rotations are represented as matrices, while the network outputs the axis-angle representation in . In total, the CVAE network contains 9.7 million parameters.

b.2 CVAE Training


The loss function used for training is primarily described in the main paper (see Eq. 7). For a training pair , the KL divergence loss term is computed between the output distributions of the encoder and conditional prior as


The SMPL loss is computed using the ground truth shape parameters provided in AMASS on the ground truth gendered body model.


For training, we use AMASS [61]

: a large, publicly-available motion capture (mocap) database containing over 11k motion sequences from 344 different people fit to SMPL. The database aggregates and standardizes many mocap datasets into one. We pre-process AMASS by cropping the middle 80% of each motion sequence, sub-sampling to 30 Hz, estimating velocities with finite differences, and using automated heuristics based on foot contacts to remove sequences with substantial terrain interaction (stairs, ramps, or platforms). We automatically annotate ground contacts for 8 body joints (

left and right toes, heels, knees, and hands) based on velocity and height. In particular, if a joint has moved less than in the last timestep and its component is within of the floor, it is considered to be in contact. For toe joints, we use a tighter height threshold of .

For training the CVAE, we use the recommended training split (save for TCD Hands [39] which contains mostly hand motions): CMU [16], MPI Limits [2], TotalCapture [90], Eyes Japan [58], KIT [62], BMLrub [89], BMLmovi [27], EKUT [62], and ACCAD [1]. For validation during training we use MPI HDM05 [67], SFU [91], and MPI MoSh [56]. Finally for evaluations (Sec. 5.3 of the main paper), we use HumanEva [84] and Transitions [61].

Training Procedure

We train using 10-step sequences sampled on-the-fly from the training set (in order to use scheduled sampling as detailed below). To acquire a training sequence, a full mocap sequence is randomly (uniformly) chosen from AMASS and then a random 10-step window within that sequence is (uniformly) sampled. Training is performed using batches of 2000 sequences for 200 epochs with Adamax 

[43] and settings , , and . We found this to be more stable than using Adam. The learning rate starts at and decays to , , and at epochs 50, 80, and 140, respectively. We linearly anneal from to its full value of over the first 50 epochs to avoid posterior collapse. We use early stopping by choosing the network parameters that result in the best validation split performance throughout training.

Training Computational Requirements

We train our CVAE on a single Tesla V100 16GB GPU, which takes approximately 4 days.

Scheduled Sampling

As explained in the main paper, our scheduled sampling follows [52]. In particular, at each training epoch we define a probability of using the ground truth state input at each timestep in a training sequence, as opposed to the model’s own previous output . Training is done using a curriculum that includes (regular supervised training), (mix of true and self inputs at each step), and finally (always use full generated rollouts). Importantly for training stability, if using the model’s own prediction as input to , we do not backpropagate gradients from the loss on back through .

For CVAE training, we use 10 epochs of regular supervised training, 10 of mixed true and self inputs, and the rest using full self-rollouts.

b.3 Initial State GMM

State Representation

Since the GMM models a single state, we use a modified representation that is minimal (avoids redundancies) in order to be useful during test-time optimization. In particular the GMM state is


with the root linear and angular velocities, and joint positions and velocities . During TestOpt, joints are determined from the current SMPL parameters so that gradients of the GMM log-likelihood (Eq. 11 in the main paper) will be used to update the initial state SMPL parameters.

Implementation Details

The GMM uses full covariance matrices for each of the 12 components and operates in the same canonical coordinate frame as the CVAE. It trains using expectation maximization222using scikit-learn on every state in the same AMASS training set used for the CVAE.

Appendix C Test-Time Optimization Details

In this section, we give additional details of the motion and shape optimization detailed in Sec. 4 of the main paper.

State Representation

In practice, for optimizing we slightly modify the state from Eq. 1 in the main paper. First, we remove the joint positions to avoid the previously discussed redundancy, which is good for training the CVAE but bad for test-time optimization. Instead, we use whenever needed at . Second, we represent body pose in the latent space of the VPoser [70] pose prior with . Whenever needed, we can map between full joint angles and latent pose using the VPoser encoder and decoder. Finally, state variables are represented in the coordinate frame of the given observations, relative to the camera, to allow fitting to data.

Floor Parameterization

As detailed in the main paper, to obtain the transformation between the canonical coordinate frame in which our CVAE is trained and the observation frame used for optimization, we additionally optimize the floor plane of the scene . This parameterization is where is the ground unit normal vector and the plane offset. To disambiguate the normal vector direction given , we assume that the -component of the normal vector must be negative, it points upward in the camera coordinate frame. This assumes the camera is not severely tilted such that the observed scene is “upside down”.

Observation-to-Canonical Transformation

We assume that gravity is orthogonal to the ground plane. Therefore, given the current floor and root state we compute a rotation and translation to the canonical CVAE frame: after the transformation, is aligned with and , faces the person towards , and the components of are 0. With this ability, we can always compute the state at time from , , and by (i) transforming to the canonical frame, (ii) using the CVAE to rollout , and (iii) transforming back to the observation frame.

Optimization Objective Details

The optimization objective is detailed in Sec. 4.2 of the main paper. To compensate for the heavy tailed behavior of real data, we use robust losses for multiple data terms. uses the Geman-McClure function [25] which for our purposes is defined as for a residual and scaling factor . We use for all experiments. uses robust bisquare weights [7]. These weights are computed based on the one-way chamfer distance term (see Eq. 14 in the main paper): residuals over the whole sequence are first normalized using a robust estimate of the standard deviation based on the median absolute deviation (MAD), then each weight is computed as


In this equation, is a normalized residual and is a tuning constant which we set to 4.6851.


As detailed in Sec. 4.2 of the main paper, our optimization is initialized by directly optimizing SMPL pose and shape parameters using and along with a pose prior and joint smoothing . The latter are weighted by and . This two-stage initialization first optimizes global translation and orientation for 30 optimization steps, followed by full pose and shape for 80 steps. At termination, we estimate velocities using finite differences, which allows direct initialization of the state . To get , the CVAE encoder is used to infer the latent transition between every pair of frames. The initial shape parameters are a direct output of the initialization optimization. Finally, for fitting to RGB(-D) the ground plane is initialized from video with PlaneRCNN [53].

Optimization (TestOpt) Details

Our optimization is implemented in PyTorch [69] using L-BFGS with a step size of 1.0 and autograd. For all experiments, we optimize using the neutral SMPL+H [78] body model in 3 stages. First, only the initial state and first 15 frames of the latent sequence are optimized for 30 iterations in order to quickly reach a reasonable initial state. Next, is fixed while the full latent dynamics sequence is optimized for 25 iterations, and then finally the full sequence and initial state are tuned together for another 15 iterations. The ground and shape are optimized in every stage.

The energy weights used for each experiment in the main paper are detailed in Tab. 5. The left part of the table indicates weights for the initialization phase (the VPoser-t baseline), while the right part is our full proposed optimization. A dash indicates the energy is not relevant for that data modality and therefore not used. Weights were manually tuned using the presented evaluation metrics and qualitative assessment. Note that for similar modalities (3D joints and keypoints, or RGB and RGB-D) weights are quite similar and so only slight tuning should be necessary to transfer to new data. The main tradeoff comes between reconstruction accuracy and motion plausibility: the motion prior is weighted higher for i3DB, which contains many severe occlusions, than for PROX RGB where the person is often nearly fully visible.

Initialization Full Optimization
AMASS (occ keypoints) 1.0 - - 0.015 0.1 1.0 - - 0.015 1.0 10.0 1.0 1.0 -
AMASS (noisy joints) 1.0 - - 0.015 10.0 1.0 - - 0.015 1.0 10.0 1.0 1.0 -
i3DB (RGB) - - 4.5 0.04 100.0 - - 4.5 0.075 0.075 100.0 0.0 10.0 15.0
PROX (RGB) - - 4.5 0.04 100.0 - - 4.5 0.05 0.05 100.0 100.0 10.0 15.0
PROX (RGB-D) - 1.0 3.0 0.1 100.0 - 1.0 3.0 0.075 0.075 100.0 100.0 10.0 90.0
Table 5: Energy weightings used in test-time optimization for each experiment in Sec. 5 of the main paper.

Appendix D MAP Objective Derivation

In this section, we formulate the core of the pose and shape optimization objective (Eq. 10 in the main paper) from a probabilistic perspective. Recall, we want to optimize the initial state , a sequence of latent variables , ground , and shape based on a sequence of observations . We are interested in the maximum a-posteriori (MAP) estimate:


Assuming is independent of , the left term is written


where is assumed to only be dependent on the initial state and past transitions. Additionally, is replaced with using CVAE rollout as detailed previously. The right term in Eq. 19 is written as


where , , and are assumed to be independent of . We then use these results within Eq. 19 to optimize the log-likelihood:


Assuming each energy presented in the main paper can be written as the log-likelihood of a distribution, this formulation recovers our optimization objective besides the additional regularizers and (these terms could, in principle, be written as part of a more complex motion prior term , however for simplicity we do not do this). Next, we connect each energy term as presented in Sec. 4.2 of the paper to the probabilistic perspective.

Motion Prior

This term is already the log-likelihood of our HuMoR motion model (Eq. 11 of the paper), which exactly aligns with the MAP derivation.

Data Term

The form of is modality-dependent. In the simplest case the observations are 3D joints (or keypoints with known correspondences) and is defined by with . Then the energy is as written in Eq. 12 of the paper. For other modalities (Eq. 13 and 14 in the paper), the data term can be seen as resulting from a more sophisticated noise model.

Ground Prior

We assume the ground should stay close to initialization so corresponding to the objective in the paper .

Shape Prior

The shape should stay near neutral zero and so which gives the energy .

Appendix E Experimental Evaluation Details

In this section, we provide details of the experimental evaluations in Sec. 5 of the main paper.

e.1 Datasets

AMASS [61] We use the same processed AMASS dataset as described in Sec. B.2 for experiments. Experiments in Sec. 5.3 and 5.4 of the main paper use the held out Transitions and HumanEva [84] subsets which together contain 4 subjects and about 19 minutes of motion.

i3DB [66] is a dataset of RGB videos captured at 30 Hz containing numerous person-environment interactions involving medium to heavy occlusions. It contains annotated 3D joint positions at 10 Hz along with a primitive cuboid 3D scene reconstruction. We run off-the-shelf 2D pose estimation (OpenPose)  [15], person segmentation [17], and plane detection [53] models to obtain inputs and initialization for our test-time optimization. We evaluate our method in Sec. 5.5 of the main paper on 6 scenes (scenes 5, 7, 10, 11, 13, and 14) containing 2 people which totals about 1800 evaluation frames. From the annotated 3D objects, we fit a ground plane which is used to compute plausibility metrics.

PROX [33] is a large-scale dataset of RGB-D videos captured at 30 Hz containing person-scene interactions in a variety of environments with light to medium occlusions. We use a subset of the qualitative part of the dataset to evaluate the plausibility of our method’s estimations. The data does not have pose annotations, but does contain the scanned scene mesh to which we fit a ground plane for plausibility evaluation. We obtain 2D pose, person masks, and ground plane initialization in the same way as for i3DB. We evaluate in Sec. 5.5 of the main paper on all videos from 4 chosen scenes (N3Office, N3Library, N0Sofa, and MPH1Library) that tend to have more dynamic motions and occlusions. In total, these scenes contain 12 unique people and about 19 minutes of video.

e.2 Baselines and Evaluation Metrics

Motion Prior Baselines

To be usable in our whole framework (test-time optimization with SMPL), the MVAE baseline is our proposed CVAE with all ablations applied simultaneously (no delta step prediction, no contact prediction, no SMPL losses, and no learned conditional prior). Note that this differs slightly from the model as presented in [52]: the decoder is an MLP rather than a mixture-of-experts and the layer sizes are larger to provide the necessary representational capacity for training on AMASS. All ablations and MVAE are trained in the exact same way as the full model. Additionally, when used in test-time optimization we use the same energy weightings as described in Tab. 5 but with irrelevant energies removed (the No Contacts ablation does not allow the use of ). Note that is still used with MVAE and all ablations, the only thing that changes is the prior in .

Motion Estimation Baselines

The VPoser-t baseline is exactly the initialization phase of our proposed test-time optimization, we use weightings in Tab. 5.

The PROX-RGB baseline fits the neutral SMPL-X [70] body model to the same 2D OpenPose detections used by our method. It does not use the face or hand keypoints for fitting similar to our approach. The PROX-D baseline uses the fittings provided with the PROX dataset, which are on the known gendered SMPL-X body model and use face/hand 2D keypoints for fitting.

The VIBE baseline uses the same 2D OpenPose detections as our method in order to define bounding boxes for inference. We found this makes for a more fair comparison since the real-time trackers used in their implementation333see Github often fail for medium to heavy occlusions common in our evaluation datasets.

Evaluation Metrics

In order to report occluded (Occ) and visible (Vis) positional errors separately, we must determine which joints/keypoints are occluded during evaluation. This is easily done for 3D tasks where “occlusions” are synthetically generated. For RGB data in i3DB, we use the person segmentation mask obtained with DeepLabv3 [17] to determine if a ground truth 3D joint is visible after projecting it to the camera.

For a joint