1 Introduction
As humans, we are constantly moving in, interacting with, and manipulating the world around us. Thus, applications such as action recognition [95, 96]
or holistic dynamic indoor scene understanding
[19] require accurate perception of 3D human pose, shape, motion, contacts, and interaction. Extensive previous work has focused on estimating 2D or 3D human pose [15, 63, 64], shape [70, 31, 80], and motion [46] from videos. These are challenging problems due to the large space of articulations, body shape, and appearance variations. Even the best methods struggle to accurately capture a wide variety of motions from varying input modalities, produce noisy or overlysmoothed motions (especially at ground contact, , footskate), and struggle in the presence of significant occlusions (, walking behind a couch as in Fig. 1).We focus on the problem of building a robust human motion model that can address these challenges. To date, most motion models directly represent sequences of likely poses — , in PCA space [68, 93, 83] or via futurepredicting autoregressive processes [88, 92, 74]. However, purely posebased predictions either make modeling environment interactions and generalization beyond training poses difficult, or quickly diverge from the space of realistic motions. On the other hand, explicit physical dynamics models [76, 51, 82, 75, 14, 13] are resource intensive and require knowledge of unobservable physical quantities. While generative models potentially offer the required flexibility, building an expressive, generalizable and robust model for realistic 3D human motions remains an open problem.
To address this, we introduce a learned, autoregressive, generative model that captures the dynamics of 3D human motion, , how pose changes over time. Rather than describing likely poses, the Human Motion Model for R
obust Estimation (HuMoR) models a probability distribution of possible
pose transitions, formulated as a conditional variational autoencoder [85]. Though not explicitly physicsbased, its components correspond to a physical model: the latent space can be interpreted as generalized forces, which are inputs to a dynamics model with numerical integration (the decoder). Moreover, ground contacts are explicitly predicted and used to constrain pose estimation at test time.After training on the large AMASS motion capture dataset [61], we use HuMoR as a motion prior at test time for 3D human perception from noisy and partial observations across different input modalities such as RGB(D) video and 2D or 3D joint sequences, as illustrated in Fig. 1 (left). In particular, we introduce a robust testtime optimization (TestOpt) strategy which interacts with HuMoR to estimate the parameters of 3D motion, body shape, the ground plane, and contact points as shown in Fig. 1 (middle/right). This interaction happens in two ways: (i) by parameterizing the motion in the latent space of HuMoR in addition to the physical space of ground / contact and shape, (ii) using HuMoR priors in order to regularize the optimization towards the space of plausible motions. These also allow our model to be seamlessly integrated into TestOpt, leading to a robust temporal pose estimation framework.
Comprehensive evaluations reveal that our method surpasses the stateoftheart on a variety of visual inputs in terms of accuracy and physical plausibility of motions under partial and severe occlusions. We further demonstrate that our motion model generalizes to diverse motions and body shapes on common generative tasks like sampling and future prediction. In a nutshell, our contributions are:

[leftmargin=]

HuMoR, a generative 3D human motion prior modeled by a novel conditional VAE which enables expressive and general motion reconstruction and generation,

A subsequent robust testtime optimization approach that uses HuMoR as a strong motion prior jointly solving for pose, body shape, and ground plane / contacts,

The capability to operate on a variety of inputs, such as RGB(D) video and 2D/3D joint position sequences, to yield accurate and plausible motions and contacts, exemplified through extensive evaluations.
Our work, more generally, suggests that neural nets for dynamics problems can benefit from architectures that model transitions, allowing control structures that emulate classical physical formulations.
2 Related Work
Muh progress has been made on building methods to recover 3D joint locations [73, 64, 63] or parameterized 3D pose and shape (, SMPL [57]) from observations [94]. We focus primarily on motion and shape estimation.
LearningBased Estimation
Deep learning approaches have shown success in regressing 3D shape and pose from a single image [47, 41, 71, 30, 29, 103, 20]. This has led to developments in predicting motion (pose sequences) and shape directly from RGB video [42, 105, 81, 87, 22]. Most recently, VIBE [46] uses adversarial training to encourage plausible outputs from a conditional recurrent motion generator. MEVA [60] maps a fixedlength image sequence to the latent space of a pretrained motion autoencoder. These methods are fast and produce accurate rootrelative joint positions for video, but motion is globally inconsistent and they struggle to generalize, , under severe occlusions. Other works have addressed occlusions but only on static images [9, 106, 77, 26]. Our approach resolves difficult occlusions in video and other modalities by producing plausible and expressive motions.
OptimizationBased Estimation
One may directly optimize to more accurately fit to observations (images or 2D pose estimators [15]) using human body models [24, 6, 10]. SMPLify [10] uses the SMPL model [57] to fit pose and shape parameters to 2D keypoints in an image using priors on pose and shape. Later works consider body silhouettes [49] and use a learned variational pose prior [70]. Optimization for motion sequences has been explored by several works [5, 40, 55, 104, 100] which apply simple smoothness priors over time. These produce reasonable estimates when the person is fully visible, but with unrealistic dynamics, , overly smooth motions and footskate.
Some works employ humanenvironment interaction and contact constraints to improve shape and pose estimation [33, 55, 34] by assuming scene geometry is given. iMapper [66] recovers both 3D joints and a primitive scene representation from RGB video based on interactions by motion retrieval, which may differ from observations. In contrast, our approach optimizes for pose and shape by using an expressive generative model that produces more natural motions than prior work with realistic ground contact.
Human Motion Models
Early sophisticated motion models for pose tracking used a variety of approaches, including mixturesofGaussians [38], linear embeddings of periodic motion [68, 93, 83], nonlinear embeddings [23]
, and nonlinear autoregressive models
[88, 97, 92, 74]. These methods operate in pose space, and are limited to specific motions. Models based on physics can potentially generalize more accurately [76, 51, 82, 75, 14, 13], while also estimating global pose and environmental interactions. However, generalpurpose physicsbased models are difficult to learn, computationally intensive at testtime, and often assume fullbody visibility to detect contacts [76, 51, 82].Many motion models have been learned for computer animation [12, 48, 79, 50, 54, 37, 86] including recent recurrent and autoregressive models [32, 28, 35, 101, 52]. These often focus on visual fidelity for a small set of characters and periodic locomotions. Works have begun exploring generating more general motion and body shapes [107, 72, 3, 21], but in the context of shortterm future prediction. HuMoR is most similar to Motion VAE [52], however we make crucial contributions to enable generalization to unseen, nonperiodic motions on novel body shapes.
3 HuMoR: 3D Human Dynamics Model
The goal of our work is to build an expressive and generalizable generative model of 3D human motion learned from real human motions, and to show that this can be used for robust testtime optimization (TestOpt) of pose and shape. In this section, we first describe the model, HuMoR.
State Representation
We represent the state of a moving person as a matrix composed of a root translation , root orientation in axisangle form, body pose joint angles and joint positions :
(1) 
where and denote the root and joint velocities, respectively, giving . Part of the state, , parameterizes the SMPL+H body model [78] which is a differentiable function that maps to body mesh vertices and joints given shape parameters . Our overparameterization allows for two ways to recover the joints: (i) explicitly from , (ii) implicitly through the SMPL+H map .
Latent Variable Dynamics Model
We are interested in modeling the probability of a time sequence of states
(2) 
where each state is assumed to be dependent on only the previous one and are learned parameters. Then must capture the plausibility of a transition.
We propose a conditional variational autoencoder (CVAE) which formulates the motion as a latent variable model as shown in Fig. 2. This transition model has a nice physical interpretation, which we detail later, and similar models have shown encouraging results for animation [52]. Following the original CVAE derivation [85], our model contains two main components. First, conditioned on the previous state , the distribution over possible latent variables is described by a learned conditional prior:
(3) 
which parameterizes a Gaussian distribution with diagonal covariance via a neural network. Intuitively, the latent variable
represents the transition to and should therefore have different distributions given different . For example, an idle person has a large variation of possible next states while a person in midair is on a nearly deterministic trajectory. Learning the conditional prior significantly improves the ability of the CVAE to generalize to diverse motions and empirically stabilizes both training and TestOpt.Second, conditioned on and , the decoder produces two outputs, and . The change in state defines the output distribution through
(4) 
We find the additive update improves predictive accuracy compared to direct nextstep prediction. The personground contact is the probability that each of 8 body joints (left and right toes, heels, knees, and hands) is in contact with the ground at time . Contacts are not part of the input to the conditional prior, only an output of the decoder. The contacts enable environmental constraints in TestOpt, and also encourage more physicsaware learning.
The complete probability model for a transition is then:
(5) 
Given an initial state , one can sample a motion sequence by alternating between sampling and sampling , from to . This model parallels a conventional stochastic physical model. The conditional prior can be seen as a controller, producing “forces” as a function of state , while the decoder acts like a combined physical dynamics model and Euler integrator of generalized position and velocity in Eq. 4.
Our model has connections to Motion VAE (MVAE) [52], which has recently shown promising results for singlecharacter locomotion by also using a VAE for , however, we find that directly applying it for estimation does not give good results (Sec. 5). We overcome this by learning a conditional prior, modeling the change in state and contacts, and ensuring consistency between the state and body model (Sec. 3.1).
Rollout Function
We use our model to define a deterministic rollout function, which is key to TestOpt. Given an initial state and a sequence of latent transitions , we define a function that deterministically maps the motion “parameters” ( to the resulting state at time . This is done through autoregressive rollout which decodes and integrates at each timestep.
Initial State GMM
3.1 Training
Our CVAE is trained using pairs of (, ). We consider the usual variational lower bound:
(6) 
The expectation term measures the reconstruction error of the decoder. The encoder, approximate posterior, is introduced for training and parameterizes a Gaussian distribution . The KL divergence regularizes its output to be near the prior. Therefore, we seek the parameters
that minimize the loss function
(7) 
over all training pairs in our dataset, where is the lower bound in Sec. 3.1 with weight , and contains additional regularizers.
For a single training pair (, ), the reconstruction loss is computed as from the decoder output with
. Gradients are backpropagated through this sample using the reparameterization trick
[44]. The regularization loss contains two terms: . The SMPL term uses the output of the body model with the estimated parameters and ground truth shape :(8)  
(9) 
The loss encourages consistency between regressed joints and those of the body model. The contact loss contains two terms. The first supervises ground contact classification with a typical binary cross entropy; the second regularizes joint velocities to be consistent with contacts with and the predicted probability that joint is in ground contact. We set and .
The initial state GMM is trained separately with expectationmaximization on all available states in the same dataset used to train the CVAE.
Implementation Details
To ease learning and improve generalization, our model operates in an aligned canonical coordinate frame at each step. All networks are 4 or 5 layer MLPs with ReLU activations and group normalization
[99].A common difficulty in training VAEs is posterior collapse [59] – when the learned latent encoding is effectively ignored by the decoder. This problem is exacerbated in CVAEs since the decoder receives additional conditioning [52, 85]. In addition to linearly annealing [11], we found the conditional prior important to combat collapse. Following [52], we also use scheduled sampling [8] in training to enable longterm generation by making the model robust to its own errors. Additional implementation details are available in the supplementary material.
4 Joint Optimization of Motion and Shape
We next use the space of motion learned by HuMoR as a prior in TestOpt to recover pose and shape from noisy and partial observations while ensuring plausibility.
4.1 Optimization Variables
Given a sequence of observations , either as 2D/3D joints, 3D point cloud, or 3D keypoints, we seek the shape and a sequence of SMPL pose parameters which describe the underlying motion being observed. We parameterize the optimized motion using our CVAE by the initial state and a sequence of latent transitions . Then at (and any intermediate steps) is determined through model rollout using the decoder as previously detailed. Compared to directly optimizing SMPL [5, 10, 40], this motion representation naturally encourages plausibility and is compact in the number of variables. To obtain the transformation between the canonical coordinate frame in which our CVAE is trained and the observation frame used for optimization, we additionally optimize the ground plane of the scene . All together, we simultaneously optimize initial state , a sequence of latent variables , ground , and shape . We assume a static camera whose intrinsics are known.
4.2 Objective & Optimization
Our optimization objective can be formulated as a maximum aposteriori (MAP) estimate (see supplementary for full derivation) resulting in the following objective which seeks a motion that is plausible under our generative model while closely matching observations:
(10) 
We next detail each of these terms which are the motion prior, data, and regularization energies. In the following, are weights to determine the contribution of each term.
Motion Prior
This energy measures the likelihood of the latent transitions and initial state under the HuMoR CVAE and GMM. It is where
(11) 
uses the learned conditional prior and uses the initial state GMM.
Data Term
This term is the only modalitydependent component of our approach, requiring different losses for different inputs: 3D joints, 2D joints, and 3D point clouds. We next specify the individual terms for each modality. All operate on the SMPL joints or mesh vertices obtained through the body model using the current SMPL parameters , which are contained in , and shape . In the simplest case, the observations are 3D joint positions (or keypoints with known correspondences) and our energy is
(12) 
with . For 2D joint positions, each with a detection confidence , we use a reprojection loss
(13) 
with the robust GemanMcClure function [10, 25] and the pinhole projection. If an estimated person segmentation mask is available, we use this to ignore spurious 2D joints. Finally, if is a 3D point cloud obtained from a depth map roughly masked around the person of interest, we use the mesh vertices to compute
(14) 
where is a robust bisquare weight [7] computed based on the Chamfer distance term.
Regularizers
The additional regularization consists of four terms . The first two terms encourage rolledout motions from the CVAE to be plausible even when the initial state is far from the optimum (early in optimization). The skeleton consistency term uses the joints directly predicted by the decoder during rollout along with the SMPL joints:
with and . The second summation uses bone lengths computed from at each step. The second regularizer ensures consistency between predicted CVAE contacts, the motion, and the environment:
where and is the contact probability output from the model for joint . The contact height term weighted by ensures the component of contacting joints are within of the floor when in contact (since joints are inside the body) in the canonical frame.
Initialization & Optimization
We initialize the temporal SMPL parameters and shape with a twostage optimization using and along with two additional regularization terms. is a pose prior where is the body joint angles represented in the latent space of the VPoser model [70, 33]. The smoothness term with smooths 3D joint positions over time. In this initialization, the global translation and orientation are optimized first, followed by full pose and shape. Finally, the initial latent sequence is computed through inference with the learned CVAE encoder
. Our optimization is implemented in PyTorch
[69] using LBFGS and autograd; with batching, a RGB video takes about 5.5 to fit. We provide further details in the supplementary material.5 Experimental Results
Future Prediction  Diversity  

Model  Contact  ADE  FDE  APD 
MVAE [52]    25.8  50.6  85.4 
HuMoR  0.88  21.5  42.1  94.9 
HuMoR (Qual)  0.88  22.0  46.3  100.0 
We evaluate HuMoR on (i) generative sampling tasks and (ii) as a prior in TestOpt to estimate motion from 3D and RGB(D) inputs. We encourage viewing the supplementary video to appreciate the qualitative improvement of our approach. Additional dataset and experiment details are available in the supplementary document.
5.1 Datasets
AMASS [61] is a large motion capture database containing diverse motions and body shapes on the SMPL body model. We subsample the dataset to 30 Hz and use the recommended training split to train the CVAE and initial state GMM in HuMoR. We evaluate on the held out Transitions and HumanEva [84] subsets (Sec. 5.3 and 5.4).
i3DB [66] contains RGB videos of personscene interactions involving medium to heavy occlusions. It provides annotated 3D joint positions and a primitive 3D scene reconstruction which we use to fit a ground plane for computing plausibility metrics. We run offtheshelf 2D pose estimation [15], person segmentation [17], and plane detection [53] models to obtain inputs for our optimization.
PROX [33] contains RGBD videos of people interacting with indoor environments. We use a subset of the qualitative data to evaluate plausibility metrics using a floor plane fit to the provided ground truth scene mesh. We obtain 2D pose, person masks, and ground plane initialization in the same way as done for i3DB.
5.2 Baselines and Evaluation Metrics
Motion Prior Baselines
We ablate the proposed CVAE to analyze its core components: No Delta directly predicts the next state from the decoder rather than the change in state, No Contacts
does not classify ground contacts,
No does not use SMPL regularization in training, and Standard Prior uses rather than our learned conditional prior. All of these ablated together recovers MVAE [52].Motion Estimation Baselines
VPosert is the initialization phase of our optimization. It uses VPoser [70] and 3D joint smoothing similar to previous works [5, 40, 104]. PROX(RGB/D) [33] are optimizationbased methods which operate on individual frames of RGB and RGBD videos, respectively. Both assume the full scene mesh is given to enforce contact and penetration constraints. VIBE [46] is a recent learned method to recover shape and pose from video.
Error Metrics
3D positional errors are measured on joints, keypoints, or mesh vertices (Vtx) and compute global mean perpoint position error unless otherwise specified. We report positional errors for all (All), occluded (Occ), and visible (Vis) observations separately. Finally, we report binary classification accuracy of the 8 personground contacts (Contact) predicted by HuMoR.
Plausibility Metrics
We use additional metrics to measure qualitative motion characteristics that joint errors cannot capture. Smoothness is evaluated by mean perjoint accelerations (Accel) [42]. Another important indicator of plausibility is ground penetration [76]. We use the true ground plane to compute the frequency (Freq) of footfloor penetrations: the fraction of frames for both the left and right toe joints that penetrate more than a threshold. We measure frequency at 0, 3, 6, 9, 12, and thresholds and report the mean. We also report mean penetration distance (Dist), where nonpenetrating frames contribute a distance of 0 to make values comparable across differing frequencies.
Positional Error  Joints  Mesh  Ground Pen  

Method  Input  Vis  Occ  All  Legs  Vtx  Contact  Accel  Freq  Dist 
VPosert  Occ Keypoints  0.67  20.76  9.22  21.08  7.95    5.71  16.77%  2.28 
MVAE [52]  Occ Keypoints  2.39  19.15  9.52  16.86  8.90    7.12  3.15%  0.30 
HuMoR (Ours)  Occ Keypoints  1.46  17.40  8.24  15.42  7.56  0.89  5.38  3.31%  0.26 
VPosert  Noisy Joints      3.67  4.47  4.98    4.61  1.35%  0.07 
MVAE [52]  Noisy Joints      2.68  3.21  4.42    6.5  1.75%  0.11 
HuMoR (Ours)  Noisy Joints      2.27  2.61  3.55  0.97  5.23  1.18%  0.05 
5.3 Generative Model Evaluation
We first evaluate HuMoR as a standalone generative model and show improved generalization to unseen motions and bodies compared to MVAE for two common tasks (see Table 1): future prediction and diverse sampling. We use AMASS sequences and start generation from the first step. Results are shown for HuMoR and a modified HuMoR (Qual) that uses as input to each step during rollout instead of , thereby enforcing skeleton consistency. This version produces qualitatively superior results for generation, but is too expensive to use during TestOpt.
For prediction, we report average displacement error (ADE) and final displacement error (FDE) [102], which measure mean joint errors over all steps and at the final step, respectively. We sample 50 motions for each initial state and the one with lowest ADE is considered the prediction. For diversity, we sample 50 motions and compute the average pairwise distance (APD) [4], the mean joint distance between all pairs of samples.
5.4 Estimation from 3D Observations
Next, we show that HuMoR also generalizes better when used in TestOpt for fitting to 3D data, and that using a motion prior is crucial to plausibly handling occlusions. AMASS sequences are used to demonstrate key abilities: (i) fitting to partial data and (ii) denoising. For the former, TestOpt fits to 43 keypoints on the body that resemble motion capture markers; keypoints that fall below at each timestep are “occluded”, leaving the legs unobservable at most steps. For denoising, Gaussian noise with standard deviation is added to 3D joint position observations.
Tab. 2 compares to VPosert and to using MVAE as the motion prior during optimization rather than HuMoR. We report leg joint errors (toes, ankles, and knees), which are often occluded, separately. The right side of the table reports plausibility metrics. HuMoR gives more accurate poses, especially for occluded keypoints and leg joints. It also estimates smoother motions with fewer and less severe ground penetrations. For denoising, VPosert oversmooths which gives the lowest acceleration but least accurate motion. TestOpt with HuMoR gives inherently smooth results while still allowing for necessarily large accelerations to fit dynamic observations. Notably, HuMoR predicts personground contact with 97% accuracy even under severe noise. Qualitative results are shown in Fig. 1 and Fig. 3.
Global Joint Error  RootAligned Joint Error  Ground Pen  
Method  Vis  Occ  All  Legs  Vis  Occ  All  Legs  Accel  Freq  Dist 
VIBE [46]  90.05  192.55  116.46  121.61  12.06  23.78  15.08  21.65  243.36  7.98%  3.01 
VPosert  28.33  40.97  31.59  35.06  12.77  26.48  16.31  25.60  4.46  9.28%  2.42 
MVAE [52]  37.54  50.63  40.91  44.42  16.00  28.32  19.17  26.63  4.96  7.43%  1.55 
No Delta  27.55  35.59  29.62  32.14  11.92  23.10  14.80  21.65  3.05  2.84%  0.58 
No Contacts  26.65  39.21  29.89  35.73  12.24  23.36  15.11  22.25  2.43  5.59%  1.70 
No  31.09  43.67  34.33  36.84  12.81  25.47  16.07  23.54  3.21  4.12%  1.31 
Standard Prior  77.60  146.76  95.42  99.01  18.67  39.40  24.01  34.02  5.98  8.30%  6.47 
HuMoR (Ours)  26.00  34.36  28.15  31.26  12.02  21.70  14.51  20.74  2.43  2.12%  0.68 
Ground Pen  
Method  Input  Accel  Freq  Dist 
VIBE [46]  RGB  86.06  23.46%  4.71 
PROXRGB [33]  RGB  196.07  2.55%  0.32 
VPosert  RGB  3.14  13.38%  2.82 
HuMoR (Ours)  RGB  1.73  9.99%  1.56 
PROXD [33]  RGBD  46.59  8.95%  1.19 
VPosert  RGBD  3.27  10.66%  2.18 
HuMoR (Ours)  RGBD  1.61  5.19%  0.85 
5.5 Estimation from RGB(D) Observations
Finally, we show that TestOpt with HuMoR can be applied to realworld RGB and RGBD observations, and outperforms baselines on positional and plausibility metrics especially from partial and noisy data. We use (90 frame) clips from i3DB [66] and PROX [33]. Tab. 3 shows results on i3DB which affords quantitative 3D joint evaluation. The top half compares to baseline estimation methods; the bottom uses ablations of HuMoR in TestOpt rather than the full model. Mean perjoint position errors are reported for global joint positions and after root alignment.
As seen in Tab. 3, VIBE gives locally accurate predictions for visible joints, but large global errors and unrealistic accelerations due to occlusions and temporal inconsistency (see Fig. 5). VPosert gives reasonable global errors, but suffers frequent penetrations as shown for sitting in Fig. 5. Using MVAE or ablations of HuMoR as the motion prior in TestOpt fails to effectively generalize to realworld data and performs worse than the full model. The conditional prior and have the largest impact, while performance even without using contacts still outperforms the baselines.
The top half of Tab. 4 evaluates plausibility on additional RGB results from PROX compared to VIBE and PROXRGB. Since PROXRGB uses the scene mesh as input to enforce environment constraints, it is a very strong baseline and its performance on penetration metrics is expectedly good. HuMoR comparatively increases penetration frequency since it only gets a rough ground plane as initialization, but gives much smoother motions.
The bottom half of Tab. 4 shows results fitting to RGBD for the same PROX data, which uses both and in TestOpt. This improves performance using HuMoR, slightly outperforming PROXD which is less robust to issues with 2D joint detections and 3D point noise causing large errors. Qualitative examples are in Fig. 1 and Fig. 4.
6 Discussion
We have introduced HuMoR, a learned generative model of 3D human motion leveraged during testtime optimization to robustly recover pose and shape from 3D, RGB, and RGBD observations. We have demonstrated that the key components of our model enable generalization to novel motions and body shapes for both generative tasks and downstream optimization. Compared to strong learning and optimizationbased baselines, HuMoR excels at estimating plausible motion under heavy occlusions, and simultaneously produces consistent ground plane and contact outputs.
Limitations & Future Work
HuMoR leaves ample room for future studies. The static camera and ground plane assumptions are reasonable for indoor scenes but true inthewild operation demands methods handling dynamic cameras and complex terrain. Our rather simplistic contact model should be upgraded to capture sceneperson interactions for improved motion and scene perception. Lastly, we plan to learn motion estimation directly from partial observations and sample multiple plausible motions rather than relying on a single local minimum.
Acknowledgments
This work was supported by the Toyota Research Institute under the University 2.0 program, a grant from the Samsung GRO program, a grant from the FordStanford Alliance, a Vannevar Bush faculty fellowship, and NSF grant IIS1763268. Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.
References
 [1] Advanced Computing Center for the Arts and Design. ACCAD MoCap Dataset.

[2]
Ijaz Akhter and Michael J. Black.
Poseconditioned joint angle limits for 3D human pose
reconstruction.
In
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 1446–1455, June 2015.  [3] Emre Aksan, Manuel Kaufmann, and Otmar Hilliges. Structured prediction helps 3d human motion modelling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7144–7153, 2019.
 [4] Sadegh Aliakbarian, Fatemeh Sadat Saleh, Mathieu Salzmann, Lars Petersson, and Stephen Gould. A stochastic conditioning scheme for diverse human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5223–5232, 2020.
 [5] Anurag Arnab, Carl Doersch, and Andrew Zisserman. Exploiting temporal context for 3d human pose estimation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3395–3404, 2019.
 [6] Andreas Baak, Meinard Müller, Gaurav Bharaj, HansPeter Seidel, and Christian Theobalt. A datadriven approach for realtime full body pose reconstruction from a depth camera. In Consumer Depth Cameras for Computer Vision, pages 71–98. Springer, 2013.
 [7] Albert E Beaton and John W Tukey. The fitting of power series, meaning polynomials, illustrated on bandspectroscopic data. Technometrics, 16(2):147–185, 1974.
 [8] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. arXiv preprint arXiv:1506.03099, 2015.
 [9] Benjamin Biggs, Sébastien Ehrhadt, Hanbyul Joo, Benjamin Graham, Andrea Vedaldi, and David Novotny. 3d multibodies: Fitting sets of plausible 3d human models to ambiguous image data. arXiv preprint arXiv:2011.00980, 2020.
 [10] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Computer Vision – ECCV 2016, Lecture Notes in Computer Science. Springer International Publishing, Oct. 2016.
 [11] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
 [12] Matthew Brand and Aaron Hertzmann. Style machines. In ACM SIGGRAPH, pages 183–192, July 2000.
 [13] Marcus A. Brubaker, David J. Fleet, and Aaron Hertzmann. Physicsbased person tracking using the anthropomorphic walker. IJCV, (1), 2010.
 [14] Marcus A. Brubaker, Leonid Sigal, and David J. Fleet. Estimating contact dynamics. In Proc. ICCV, 2009.
 [15] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. Openpose: Realtime multiperson 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
 [16] Carnegie Mellon University. CMU MoCap Dataset.
 [17] LiangChieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
 [18] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. arXiv preprint arXiv:1806.07366, 2018.
 [19] Yixin Chen, Siyuan Huang, Tao Yuan, Siyuan Qi, Yixin Zhu, and SongChun Zhu. Holistic++ scene understanding: Singleview 3d holistic scene parsing and human pose estimation with humanobject interaction and physical commonsense. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8648–8657, 2019.
 [20] Vasileios Choutas, Georgios Pavlakos, Timo Bolkart, Dimitrios Tzionas, and Michael J Black. Monocular expressive body regression through bodydriven attention. In European Conference on Computer Vision, pages 20–40. Springer, 2020.
 [21] Enric Corona, Albert Pumarola, Guillem Alenya, and Francesc MorenoNoguer. Contextaware human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020.
 [22] Carl Doersch and Andrew Zisserman. Sim2real transfer learning for 3d human pose estimation: motion to the rescue. arXiv preprint arXiv:1907.02499, 2019.
 [23] Ahmed Elgammal and ChanSu Lee. Separating style and content on a nonlinear manifold. In IEEE Conf. Comp. Vis. and Pattern Recognition, pages 478–485, 2004. Vol. 1.
 [24] Varun Ganapathi, Christian Plagemann, Daphne Koller, and Sebastian Thrun. Real time motion capture using a single timeofflight camera. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 755–762. IEEE, 2010.
 [25] S. Geman and D. McClure. Statistical methods for tomographic image reconstruction. Bulletin of the International Statistical Institute, 52(4):5–21, 1987.
 [26] Georgios Georgakis, Ren Li, Srikrishna Karanam, Terrence Chen, Jana Košecká, and Ziyan Wu. Hierarchical kinematic human mesh recovery. In European Conference on Computer Vision, pages 768–784. Springer, 2020.
 [27] Saeed Ghorbani, Kimia Mahdaviani, Anne Thaler, Konrad Kording, Douglas James Cook, Gunnar Blohm, and Nikolaus F. Troje. MoVi: A large multipurpose motion and video dataset, 2020.
 [28] Saeed Ghorbani, Calden Wloka, Ali Etemad, Marcus A Brubaker, and Nikolaus F Troje. Probabilistic character motion synthesis using a hierarchical deep latent variable model. In Computer Graphics Forum, volume 39. Wiley Online Library, 2020.
 [29] Riza Alp Guler and Iasonas Kokkinos. Holopose: Holistic 3d human reconstruction inthewild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10884–10894, 2019.
 [30] Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7297–7306, 2018.
 [31] Marc Habermann, Weipeng Xu, Michael Zollhofer, Gerard PonsMoll, and Christian Theobalt. Deepcap: Monocular human performance capture using weak supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5052–5063, 2020.
 [32] Ikhsanul Habibie, Daniel Holden, Jonathan Schwarz, Joe Yearsley, and Taku Komura. A recurrent variational autoencoder for human motion synthesis. In 28th British Machine Vision Conference, 2017.
 [33] Mohamed Hassan, Vasileios Choutas, Dimitrios Tzionas, and Michael J. Black. Resolving 3D human pose ambiguities with 3D scene constraints. In International Conference on Computer Vision, pages 2282–2292, 2019.
 [34] Mohamed Hassan, Partha Ghosh, Joachim Tesch, Dimitrios Tzionas, and Michael J. Black. Populating 3d scenes by learning humanscene interaction, 2020.
 [35] Gustav Eje Henter, Simon Alexanderson, and Jonas Beskow. Moglow: Probabilistic and controllable motion synthesis using normalising flows. ACM Transactions on Graphics (TOG), 39(6):1–14, 2020.
 [36] I. Higgins, Loïc Matthey, A. Pal, C. Burgess, Xavier Glorot, M. Botvinick, S. Mohamed, and Alexander Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2017.
 [37] Daniel Holden, Taku Komura, and Jun Saito. Phasefunctioned neural networks for character control. ACM Transactions on Graphics (TOG), 36(4):1–13, 2017.
 [38] Nicholas R. Howe, Michael E. Leventon, and William T. Freeman. Bayesian reconstruction of 3D human motion from singlecamera video. In Advances in Neural Information Processing Systems 12, pages 820–826, 2000.
 [39] Ludovic Hoyet, Kenneth Ryall, Rachel McDonnell, and Carol O’Sullivan. Sleight of hand: Perception of finger motion from reduced marker sets. In Proceedings of the ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, I3D ’12, page 79–86, 2012.
 [40] Yinghao Huang, Federica Bogo, Christoph Lassner, Angjoo Kanazawa, Peter V Gehler, Javier Romero, Ijaz Akhter, and Michael J Black. Towards accurate markerless human shape and pose estimation over time. In 2017 international conference on 3D vision, pages 421–430. IEEE, 2017.
 [41] Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. Endtoend recovery of human shape and pose. In Computer Vision and Pattern Regognition, 2018.
 [42] Angjoo Kanazawa, Jason Y Zhang, Panna Felsen, and Jitendra Malik. Learning 3d human dynamics from video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
 [43] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [44] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [45] Ivan Kobyzev, Simon J.D. Prince, and Marcus A. Brubaker. Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
 [46] Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. Vibe: Video inference for human body pose and shape estimation. In The IEEE Conference on Computer Vision and Pattern Recognition, June 2020.
 [47] Nikos Kolotouros, Georgios Pavlakos, Michael J. Black, and Kostas Daniilidis. Learning to reconstruct 3D human pose and shape via modelfitting in the loop. In Proceedings International Conference on Computer Vision (ICCV), pages 2252–2261. IEEE, Oct. 2019. ISSN: 23807504.
 [48] Lucas Kovar, Michael Gleicher, and Frédéric Pighin. Motion graphs. In ACM Transactions on Graphics 21(3), Proc. SIGGRAPH, pages 473–482, July 2002.
 [49] Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J Black, and Peter V Gehler. Unite the people: Closing the loop between 3d and 2d human representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6050–6059, 2017.
 [50] Yan Li, Tianshu Wang, and HeungYeung Shum. Motion texture: A twolevel statistical model for character motion synthesis. In ACM Transactions on Graphics 21(3), Proc. SIGGRAPH, pages 465–472, July 2002.
 [51] Zongmian Li, Jiri Sedlar, Justin Carpentier, Ivan Laptev, Nicolas Mansard, and Josef Sivic. Estimating 3d motion and forces of personobject interactions from monocular video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
 [52] Hung Yu Ling, Fabio Zinno, George Cheng, and Michiel van de Panne. Character controllers using motion vaes. In ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH), volume 39. ACM, 2020.
 [53] Chen Liu, Kihwan Kim, Jinwei Gu, Yasutaka Furukawa, and Jan Kautz. Planercnn: 3d plane detection and reconstruction from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4450–4459, 2019.
 [54] C. Karen Liu, Aaron Hertzmann, and Zoran Popović. Learning physicsbased motion style with nonlinear inverse optimization. ACM Trans. Graph, 2005.
 [55] Miao Liu, Dexin Yang, Yan Zhang, Zhaopeng Cui, James M Rehg, and Siyu Tang. 4d human body capture from egocentric video via 3d scene grounding. arXiv preprint arXiv:2011.13341, 2020.
 [56] Matthew Loper, Naureen Mahmood, and Michael J. Black. MoSh: Motion and Shape Capture from Sparse Markers. ACM Trans. Graph., 33(6), Nov. 2014.
 [57] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard PonsMoll, and Michael J. Black. SMPL: A skinned multiperson linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, Oct. 2015.
 [58] Eyes JAPAN Co. Ltd. Eyes Japan MoCap Dataset.
 [59] James Lucas, George Tucker, Roger B Grosse, and Mohammad Norouzi. Dont blame the elbo! a linear vae perspective on posterior collapse. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlchéBuc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
 [60] Zhengyi Luo, S Alireza Golestaneh, and Kris M Kitani. 3d human motion estimation via motion compression and refinement. In Proceedings of the Asian Conference on Computer Vision, 2020.
 [61] Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard PonsMoll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. In International Conference on Computer Vision, 2019.
 [62] C. Mandery, Ö. Terlemez, M. Do, N. Vahrenkamp, and T. Asfour. The KIT wholebody human motion database. In 2015 International Conference on Advanced Robotics (ICAR), pages 329–336, July 2015.
 [63] Dushyant Mehta, Oleksandr Sotnychenko, Franziska Mueller, Weipeng Xu, Mohamed Elgharib, Pascal Fua, HansPeter Seidel, Helge Rhodin, Gerard PonsMoll, and Christian Theobalt. Xnect: Realtime multiperson 3d motion capture with a single rgb camera. ACM Transactions on Graphics (TOG), 39(4):82–1, 2020.
 [64] Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mohammad Shafiei, HansPeter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. Vnect: Realtime 3d human pose estimation with a single rgb camera. ACM Transactions on Graphics (TOG), 36(4):1–14, 2017.
 [65] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 [66] Aron Monszpart, Paul Guerrero, Duygu Ceylan, Ersin Yumer, and Niloy J. Mitra. iMapper: Interactionguided scene mapping from monocular videos. ACM SIGGRAPH, 2019.
 [67] M. Müller, T. Röder, M. Clausen, B. Eberhardt, B. Krüger, and A. Weber. Documentation mocap database HDM05. Technical Report CG20072, Universität Bonn, 2007.
 [68] Dirk Ormoneit, Hedvig Sidenbladh, Michael J. Black, and Trevor Hastie. Learning and tracking cyclic human motion. In Advances in Neural Information Processing Systems 13, pages 894–900, 2001.
 [69] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
 [70] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, 2019.
 [71] Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. Learning to estimate 3d human pose and shape from a single color image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 459–468, 2018.
 [72] Dario Pavllo, Christoph Feichtenhofer, Michael Auli, and David Grangier. Modeling human motion with quaternionbased neural networks. International Journal of Computer Vision, pages 1–18, 2019.
 [73] Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3d human pose estimation in video with temporal convolutions and semisupervised training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7753–7762, 2019.
 [74] Vladimir Pavlović, James M. Rehg, and John MacCormick. Learning switching linear models of human motion. In Advances in Neural Information Processing Systems 13, pages 981–987, 2001.

[75]
Xue Bin Peng, Angjoo Kanazawa, Jitendra Malik, Pieter Abbeel, and Sergey
Levine.
Sfv: Reinforcement learning of physical skills from videos.
ACM Trans. Graph, 2018.  [76] Davis Rempe, Leonidas J. Guibas, Aaron Hertzmann, Bryan Russell, Ruben Villegas, and Jimei Yang. Contact and human dynamics from monocular video. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
 [77] Chris Rockwell and David F Fouhey. Fullbody awareness from partial observations. arXiv preprint arXiv:2008.06046, 2020.
 [78] Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), Nov. 2017.

[79]
Charles Rose, Michael F. Cohen, and Bobby Bodenheimer.
Verbs and adverbs: Multidimensional motion interpolation.
IEEE Computer Graphics and Applications, 18(5):32–40, 1998.  [80] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixelaligned implicit function for highresolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
 [81] Mingyi Shi, Kfir Aberman, Andreas Aristidou, Taku Komura, Dani Lischinski, Daniel CohenOr, and Baoquan Chen. Motionet: 3d human motion reconstruction from monocular video with skeleton consistency. ACM Transactions on Graphics (TOG), 40(1):1–15, 2020.
 [82] Soshi Shimada, Vladislav Golyanik, Weipeng Xu, and Christian Theobalt. Physcap: Physically plausible monocular 3d motion capture in real time. ACM Trans. Graph., 39(6), Nov. 2020.
 [83] Hedvig Sidenbladh, Michael J. Black, and David J. Fleet. Stochastic tracking of 3D human figures using 2D image motion. In ECCV, pages 702–718, 2000. Part II.
 [84] Leonid Sigal, Alexandru O Balan, and Michael J Black. Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International journal of computer vision, 87(12):4, 2010.
 [85] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28, pages 3483–3491. Curran Associates, Inc., 2015.
 [86] Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. Neural state machine for characterscene interactions. ACM Trans. Graph., 38(6):209–1, 2019.
 [87] Yu Sun, Yun Ye, Wu Liu, Wenpeng Gao, Yili Fu, and Tao Mei. Human mesh recovery from monocular images via a skeletondisentangled representation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5349–5358, 2019.
 [88] Graham W. Taylor, Geoffrey E. Hinton, and Sam T. Roweis. Modeling human motion using binary latent variables. In Proc. NIPS, 2007.
 [89] Nikolaus F. Troje. Decomposing biological motion: A framework for analysis and synthesis of human gait patterns. Journal of Vision, 2(5):2–2, Sept. 2002.
 [90] Matthew Trumble, Andrew Gilbert, Charles Malleson, Adrian Hilton, and John Collomosse. Total capture: 3d human pose estimation fusing video and inertial sensors. In Proceedings of the British Machine Vision Conference (BMVC), Sept. 2017.
 [91] Simon Fraser University and National University of Singapore. SFU Motion Capture Database.
 [92] Raquel Urtasun, David J. Fleet, and Pascal Fua. 3D people tracking with Gaussian process dynamical models. In IEEE Conf. Comp. Vis. & Pattern Rec., pages 238–245, 2006. Vol. 1.
 [93] Raquel Urtasun, David J. Fleet, and Pascal Fua. Temporal motion models for monocular and multiview 3D human body tracking. CVIU, 104(2):157–177, 2006.
 [94] Gül Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia Schmid. BodyNet: Volumetric inference of 3D human body shapes. In ECCV, 2018.
 [95] Gül Varol, Ivan Laptev, Cordelia Schmid, and Andrew Zisserman. Synthetic humans for action recognition from unseen viewpoints. arXiv preprint arXiv:1912.04070, 2019.
 [96] Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J Black, Ivan Laptev, and Cordelia Schmid. Learning from synthetic humans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 109–117, 2017.
 [97] Jack M. Wang. Gaussian process dynamical models for human motion. Master’s thesis, University of Toronto, 2005.
 [98] Jianqiao Wangni, Dahua Lin, Ji Liu, Kostas Daniilidis, and Jianbo Shi. Towards statistically provable geometric 3d human pose recovery. SIAM Journal on Imaging Sciences, 14(1):246–270, 2021.
 [99] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
 [100] Donglai Xiang, Hanbyul Joo, and Yaser Sheikh. Monocular total capture: Posing face, body, and hands in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10974, 2019.
 [101] Dongseok Yang, Doyeon Kim, and SungHee Lee. Realtime lowerbody pose prediction from sparse upperbody tracking signals. arXiv preprint arXiv:2103.01500, 2021.
 [102] Ye Yuan and Kris Kitani. Dlow: Diversifying latent flows for diverse human motion prediction. In European Conference on Computer Vision, pages 346–364. Springer, 2020.
 [103] Andrei Zanfir, Eduard Gabriel Bazavan, Hongyi Xu, William T Freeman, Rahul Sukthankar, and Cristian Sminchisescu. Weakly supervised 3d human pose and shape reconstruction with normalizing flows. In European Conference on Computer Vision, pages 465–481. Springer, 2020.
 [104] Andrei Zanfir, Elisabeta Marinoiu, and Cristian Sminchisescu. Monocular 3d pose and shape estimation of multiple people in natural scenesthe importance of multiple scene constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2148–2157, 2018.
 [105] Jason Y Zhang, Panna Felsen, Angjoo Kanazawa, and Jitendra Malik. Predicting 3d human dynamics from video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7114–7123, 2019.
 [106] Tianshu Zhang, Buzhen Huang, and Yangang Wang. Objectoccluded human shape and pose estimation from a single color image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
 [107] Yan Zhang, Michael J. Black, and Siyu Tang. We are more than our joints: Predicting how 3d bodies move, 2020.
Appendices
Here we provide details and extended evaluations omitted from the main paper^{1}^{1}1In the rest of this document we refer to the main paper briefly as paper. for brevity. App. A provides extended discussions, App. B and App. C give method details regarding the HuMoR model and testtime optimization (TestOpt), App. D derives our optimization energy from a probabilistic perspective, App. E provides experimental details from the main paper, and App. F contains extended experimental evaluations.
We encourage the reader to view the supplementary videos on the project webpage for extensive qualitative results. We further discuss these results in App. F.
Appendix A Discussions
State Representation
Our state representation is somewhat redundant to include both explicit joint positions and SMPL parameters (which also give joint positions ). This is motivated by recent works [52, 107] which show that using an extrinsic representation of body keypoints (joint positions or mesh vertices) helps in learning motion characteristics like static contact, thereby improving the visual quality of generated motions. The overparameterization, unique to our approach, additionally allows for consistency losses leveraged during CVAE training and in TestOpt.
Another noteworthy property of our state is that it does not explicitly represent fullbody shape – only bone proportions are implicitly encoded through joint locations. During training, we use shape parameters provided in AMASS [61] to compute , but otherwise the CVAE is shapeunaware. Extending our formulation to include fullbody shape is an important direction for improved generalization and should be considered in future work.
Conditioning on More Time Steps
Alternatively, we could condition the dynamics learned by the CVAE with additional previous steps, , however since includes velocities this is unnecessary and only increases the chances of overfitting to training motions. It would additionally increases the necessary computation for both generation and TestOpt.
Why CVAE? Our use of a CVAE to model motion is primarily motivated by recent promising results in the graphics community [52, 28]. Not only is it a simple solution, but also affords the physical interpretation presented in the main paper. Other deep generative models could be considered for
, however each have potential issues compared to our CVAE. The conditional generative adversarial network
[65] would use standard normal noise for , which we show is insufficient in multiple experiments. Furthermore, it does not allow for inferring a latent transition . Past works have had success with recurrent and variationalrecurrent architectures [107]. As discussed previously, the reliance of these networks on multiple timesteps increases overfitting which is especially dangerous for our estimation application which requires being able to represent arbitrary observed motions. Finally, normalizing flows [45] and neural ODEs [18] show exciting potential for modeling human motion, however conditional generation with these models is not yet welldeveloped.A Note on Vae [36]
The KL weight in Eq. 7 of the main paper is not directly comparable to a typical VAE [36] due to various implementation details. First, is the meansquared error (MSE) of the unnormalized state rather than the true loglikelihood. The use of additional regularizers that are not formulated probabilistically to be part of the reconstruction loss further compounds the difference. Furthermore, in practice losses are averaged over both the feature and batch dimensions as not to depend on chosen dimensionalities. All these differences result in setting .
The Need for in Optimization
The motion prior term , which leverages our learned conditional prior and GMM, nicely falls out of the MAP derivation (see App. D below) and is by itself reasonable to ensure motion is plausible. However, in practice it can be prone to local minima and slow to converge without any regularization. This is primarily because HuMoR is trained on clean motion capture data from AMASS [61], but in the early stages of optimization the initial state will be far from this domain. This means rolled out motions using the CVAE decoder will be implausible and the likelihood output from learned conditional prior is not necessarily meaningful (since inputs will be well outside the training distribution). The additional regularizers presented in the main paper, mainly and , allow us to resolve this issue by reflecting expected behavior of the motion model when it is producing truly plausible motions ( is similar to the training data).
On Evaluation Metrics
As discussed in prior work [76]
, traditional positional metrics used to evaluate rootrelative pose estimates do not capture the accuracy of the absolute (“global”) motion nor its physical/perceptual plausibility. This is why we use a range of metrics to capture both the global joint accuracy, local joint accuracy (after aligning root joints), and plausibility of a motion. However, these metrics still have flaws and there is a need to develop more informative motion estimation evaluation metrics for both absolute accuracy and plausibility. This is especially true in scenarios of severe occlusions where there is not a single correct answer: even if the “ground truth” 3D joints are available, there may be multiple motions that explain the partial observations equally well.
On Convergence
Our multiobjective optimization uses a mixture of convex and nonconvex loss functions. As we utilize LBFGS, the minimum energy solution we report is only locally optimal. While simulated annealing or MCMC / HMC (Markov Chain Monte Carlo / Hamiltonian Monte Carlo) type of exploration approaches can be deployed to search for the global optimum, such methods would incur heavy computational load and hence are prohibitive in our setting. Thanks to the accurate initialization, we found that most of the time TestOpt converges to a good minimum. This observation is also lightly supported by recent work arguing that statistically provable convergence can be attained for the human pose problem under convex and nonconvex regularization using a multistage optimization scheme
[98].a.1 Assumptions and Limitations
On the Assumption of a Ground Plane
We use the ground during TestOpt to obtain a transformation to the canonical reference frame where our prior is trained. While this is a resonable assumption in a majority of scenarios, we acknowledge that certain applications might require inthewild operation where a single ground plane does not exist climbing up stairs or moving over complex terrain. In such scenarios, we require a consistent reference frame, which can be computed from: (i) an accelerometer if a mobile device is used, (ii) pose of static, rigid objects if an object detector is deployed, (iii) fiducial tags or any other means of obtaining a gravity direction.
Note that the ground plane is not an essential piece in the testtime optimization. It is a requirement only because of the way our CVAE is trained: on motions with a ground plane at , gravity in the direction, and without complex terrain interactions. Although we empirically noticed that convergence of training necessitates this assumption, other architectures or the availability of larger inthewild motion datasets might make training HuMoR possible under arbitrary poses. This perspective should clarify why our method can work when the ground is invisible: TestOpt might converge from a bad initialization as long as our prior (HuMoR) is able to account for the observation.
On the Assumption of a Static Camera
While a static camera is assumed in all of our evaluations, recent advances in 3D computer vision make it possible to overcome this limitation. Our method, backed by either a structure from motion / SLAM pipeline or a camera relocalization engine, can indeed work in scenarios where the camera moves as well as the human targets. A more sophisticated solution could leverage our learned motion model to disambiguate between camera and human motion. Expectedly, this requires further investigation, making room for future studies as discussed at the end of the main paper.
Other Limitations and Failure Cases
As discussed in Sec. 6 of the main paper, HuMoR has limitations that motivate multiple future directions. First, optimization is generally slow compared to learningbased (direct prediction) methods. This also reflects on our testtime optimization. Approaches for learning to optimize can come handy in increasing the efficiency of our method. Additionally, our current formulation of TestOpt allows only for a single output, the local optimum. Therefore, future work may explore learned approaches yielding multihypothesis output, which can be used to characterize uncertainty.
Specific failure cases (as shown in the supplementary videos and Fig. 6) further highlight areas of future improvement. First, extreme occlusions (e.g. only a few visible points as in Fig. 6 left), especially at the first frame which determines , makes for a difficult optimization that often lands in local minima with implausible motions. Second, uncommon motions that are rare during CVAE training, such as laying down in Fig. 6 (middle), can cause spurious ground plane outputs as TestOpt attempts to make the motion more likely. Leveraging more holistic scene understanding methods and models of humanenvironment interaction will help in these cases. Finally, our method is dependent on motion in order to resolve ambiguity, which is usually very helpful but has corner cases as shown in Fig. 6 (right). For example, if the observed person is nearly static, the optimization may produce implausible poses due to ambiguous occlusions (standing when really the person is sitting) and/or incorrect ground plane estimations.
Appendix B HuMoR Model Details
In this section, we provide additional implementation details for the HuMoR motion model described in Sec. 3 of the main paper.
b.1 CVAE Architecture and Implementation
Body Model
We use the SMPL+H body model [78] since it is used by the AMASS [61] dataset. However, our focus is on modeling body motion, so HuMoR and TestOpt do not consider the hand joints (leaving the 22 body joints including the root). Hand joints could be straightforwardly optimized with body motion, but was not in our current scope.
Canonical Coordinate Frame
To ease learning and improve generalization, our network operates on inputs in a canonical coordinate frame. Specifically, based on we apply a rotation around the up () axis and translation in such that the and components of are 0 and the person is facing the direction w.r.t. .
Architecture
The encoder and prior networks are identical multilayer perceptrons (MLP) with 5 layers and hidden size 1024. The decoder is a 4layer MLP with hidden sizes (1024, 1024, 512). The latent transition
is skipconnected to every layer of the decoder in order to emphasize its importance and help avoid posterior collapse [52]. ReLU nonlinearities and group normalization [99] with 16 groups are used between all layers except outputs in each network. Input rotations are represented as matrices, while the network outputs the axisangle representation in . In total, the CVAE network contains 9.7 million parameters.b.2 CVAE Training
Losses
The loss function used for training is primarily described in the main paper (see Eq. 7). For a training pair , the KL divergence loss term is computed between the output distributions of the encoder and conditional prior as
(15) 
The SMPL loss is computed using the ground truth shape parameters provided in AMASS on the ground truth gendered body model.
Dataset
For training, we use AMASS [61]
: a large, publiclyavailable motion capture (mocap) database containing over 11k motion sequences from 344 different people fit to SMPL. The database aggregates and standardizes many mocap datasets into one. We preprocess AMASS by cropping the middle 80% of each motion sequence, subsampling to 30 Hz, estimating velocities with finite differences, and using automated heuristics based on foot contacts to remove sequences with substantial terrain interaction (stairs, ramps, or platforms). We automatically annotate ground contacts for 8 body joints (
left and right toes, heels, knees, and hands) based on velocity and height. In particular, if a joint has moved less than in the last timestep and its component is within of the floor, it is considered to be in contact. For toe joints, we use a tighter height threshold of .For training the CVAE, we use the recommended training split (save for TCD Hands [39] which contains mostly hand motions): CMU [16], MPI Limits [2], TotalCapture [90], Eyes Japan [58], KIT [62], BMLrub [89], BMLmovi [27], EKUT [62], and ACCAD [1]. For validation during training we use MPI HDM05 [67], SFU [91], and MPI MoSh [56]. Finally for evaluations (Sec. 5.3 of the main paper), we use HumanEva [84] and Transitions [61].
Training Procedure
We train using 10step sequences sampled onthefly from the training set (in order to use scheduled sampling as detailed below). To acquire a training sequence, a full mocap sequence is randomly (uniformly) chosen from AMASS and then a random 10step window within that sequence is (uniformly) sampled. Training is performed using batches of 2000 sequences for 200 epochs with Adamax
[43] and settings , , and . We found this to be more stable than using Adam. The learning rate starts at and decays to , , and at epochs 50, 80, and 140, respectively. We linearly anneal from to its full value of over the first 50 epochs to avoid posterior collapse. We use early stopping by choosing the network parameters that result in the best validation split performance throughout training.Training Computational Requirements
We train our CVAE on a single Tesla V100 16GB GPU, which takes approximately 4 days.
Scheduled Sampling
As explained in the main paper, our scheduled sampling follows [52]. In particular, at each training epoch we define a probability of using the ground truth state input at each timestep in a training sequence, as opposed to the model’s own previous output . Training is done using a curriculum that includes (regular supervised training), (mix of true and self inputs at each step), and finally (always use full generated rollouts). Importantly for training stability, if using the model’s own prediction as input to , we do not backpropagate gradients from the loss on back through .
For CVAE training, we use 10 epochs of regular supervised training, 10 of mixed true and self inputs, and the rest using full selfrollouts.
b.3 Initial State GMM
State Representation
Since the GMM models a single state, we use a modified representation that is minimal (avoids redundancies) in order to be useful during testtime optimization. In particular the GMM state is
(16) 
with the root linear and angular velocities, and joint positions and velocities . During TestOpt, joints are determined from the current SMPL parameters so that gradients of the GMM loglikelihood (Eq. 11 in the main paper) will be used to update the initial state SMPL parameters.
Implementation Details
The GMM uses full covariance matrices for each of the 12 components and operates in the same canonical coordinate frame as the CVAE. It trains using expectation maximization^{2}^{2}2using scikitlearn on every state in the same AMASS training set used for the CVAE.
Appendix C TestTime Optimization Details
In this section, we give additional details of the motion and shape optimization detailed in Sec. 4 of the main paper.
State Representation
In practice, for optimizing we slightly modify the state from Eq. 1 in the main paper. First, we remove the joint positions to avoid the previously discussed redundancy, which is good for training the CVAE but bad for testtime optimization. Instead, we use whenever needed at . Second, we represent body pose in the latent space of the VPoser [70] pose prior with . Whenever needed, we can map between full joint angles and latent pose using the VPoser encoder and decoder. Finally, state variables are represented in the coordinate frame of the given observations, relative to the camera, to allow fitting to data.
Floor Parameterization
As detailed in the main paper, to obtain the transformation between the canonical coordinate frame in which our CVAE is trained and the observation frame used for optimization, we additionally optimize the floor plane of the scene . This parameterization is where is the ground unit normal vector and the plane offset. To disambiguate the normal vector direction given , we assume that the component of the normal vector must be negative, it points upward in the camera coordinate frame. This assumes the camera is not severely tilted such that the observed scene is “upside down”.
ObservationtoCanonical Transformation
We assume that gravity is orthogonal to the ground plane. Therefore, given the current floor and root state we compute a rotation and translation to the canonical CVAE frame: after the transformation, is aligned with and , faces the person towards , and the components of are 0. With this ability, we can always compute the state at time from , , and by (i) transforming to the canonical frame, (ii) using the CVAE to rollout , and (iii) transforming back to the observation frame.
Optimization Objective Details
The optimization objective is detailed in Sec. 4.2 of the main paper. To compensate for the heavy tailed behavior of real data, we use robust losses for multiple data terms. uses the GemanMcClure function [25] which for our purposes is defined as for a residual and scaling factor . We use for all experiments. uses robust bisquare weights [7]. These weights are computed based on the oneway chamfer distance term (see Eq. 14 in the main paper): residuals over the whole sequence are first normalized using a robust estimate of the standard deviation based on the median absolute deviation (MAD), then each weight is computed as
(17) 
In this equation, is a normalized residual and is a tuning constant which we set to 4.6851.
Initialization
As detailed in Sec. 4.2 of the main paper, our optimization is initialized by directly optimizing SMPL pose and shape parameters using and along with a pose prior and joint smoothing . The latter are weighted by and . This twostage initialization first optimizes global translation and orientation for 30 optimization steps, followed by full pose and shape for 80 steps. At termination, we estimate velocities using finite differences, which allows direct initialization of the state . To get , the CVAE encoder is used to infer the latent transition between every pair of frames. The initial shape parameters are a direct output of the initialization optimization. Finally, for fitting to RGB(D) the ground plane is initialized from video with PlaneRCNN [53].
Optimization (TestOpt) Details
Our optimization is implemented in PyTorch [69] using LBFGS with a step size of 1.0 and autograd. For all experiments, we optimize using the neutral SMPL+H [78] body model in 3 stages. First, only the initial state and first 15 frames of the latent sequence are optimized for 30 iterations in order to quickly reach a reasonable initial state. Next, is fixed while the full latent dynamics sequence is optimized for 25 iterations, and then finally the full sequence and initial state are tuned together for another 15 iterations. The ground and shape are optimized in every stage.
The energy weights used for each experiment in the main paper are detailed in Tab. 5. The left part of the table indicates weights for the initialization phase (the VPosert baseline), while the right part is our full proposed optimization. A dash indicates the energy is not relevant for that data modality and therefore not used. Weights were manually tuned using the presented evaluation metrics and qualitative assessment. Note that for similar modalities (3D joints and keypoints, or RGB and RGBD) weights are quite similar and so only slight tuning should be necessary to transfer to new data. The main tradeoff comes between reconstruction accuracy and motion plausibility: the motion prior is weighted higher for i3DB, which contains many severe occlusions, than for PROX RGB where the person is often nearly fully visible.
Initialization  Full Optimization  

Dataset  
AMASS (occ keypoints)  1.0      0.015  0.1  1.0      0.015  1.0  10.0  1.0  1.0    
AMASS (noisy joints)  1.0      0.015  10.0  1.0      0.015  1.0  10.0  1.0  1.0    
i3DB (RGB)      4.5  0.04  100.0      4.5  0.075  0.075  100.0  0.0  10.0  15.0  
PROX (RGB)      4.5  0.04  100.0      4.5  0.05  0.05  100.0  100.0  10.0  15.0  
PROX (RGBD)    1.0  3.0  0.1  100.0    1.0  3.0  0.075  0.075  100.0  100.0  10.0  90.0 
Appendix D MAP Objective Derivation
In this section, we formulate the core of the pose and shape optimization objective (Eq. 10 in the main paper) from a probabilistic perspective. Recall, we want to optimize the initial state , a sequence of latent variables , ground , and shape based on a sequence of observations . We are interested in the maximum aposteriori (MAP) estimate:
(18)  
(19) 
Assuming is independent of , the left term is written
(20) 
where is assumed to only be dependent on the initial state and past transitions. Additionally, is replaced with using CVAE rollout as detailed previously. The right term in Eq. 19 is written as
(21)  
(22) 
where , , and are assumed to be independent of . We then use these results within Eq. 19 to optimize the loglikelihood:
(23)  
(24) 
Assuming each energy presented in the main paper can be written as the loglikelihood of a distribution, this formulation recovers our optimization objective besides the additional regularizers and (these terms could, in principle, be written as part of a more complex motion prior term , however for simplicity we do not do this). Next, we connect each energy term as presented in Sec. 4.2 of the paper to the probabilistic perspective.
Motion Prior
This term is already the loglikelihood of our HuMoR motion model (Eq. 11 of the paper), which exactly aligns with the MAP derivation.
Data Term
The form of is modalitydependent. In the simplest case the observations are 3D joints (or keypoints with known correspondences) and is defined by with . Then the energy is as written in Eq. 12 of the paper. For other modalities (Eq. 13 and 14 in the paper), the data term can be seen as resulting from a more sophisticated noise model.
Ground Prior
We assume the ground should stay close to initialization so corresponding to the objective in the paper .
Shape Prior
The shape should stay near neutral zero and so which gives the energy .
Appendix E Experimental Evaluation Details
In this section, we provide details of the experimental evaluations in Sec. 5 of the main paper.
e.1 Datasets
AMASS [61] We use the same processed AMASS dataset as described in Sec. B.2 for experiments. Experiments in Sec. 5.3 and 5.4 of the main paper use the held out Transitions and HumanEva [84] subsets which together contain 4 subjects and about 19 minutes of motion.
i3DB [66] is a dataset of RGB videos captured at 30 Hz containing numerous personenvironment interactions involving medium to heavy occlusions. It contains annotated 3D joint positions at 10 Hz along with a primitive cuboid 3D scene reconstruction. We run offtheshelf 2D pose estimation (OpenPose) [15], person segmentation [17], and plane detection [53] models to obtain inputs and initialization for our testtime optimization. We evaluate our method in Sec. 5.5 of the main paper on 6 scenes (scenes 5, 7, 10, 11, 13, and 14) containing 2 people which totals about 1800 evaluation frames. From the annotated 3D objects, we fit a ground plane which is used to compute plausibility metrics.
PROX [33] is a largescale dataset of RGBD videos captured at 30 Hz containing personscene interactions in a variety of environments with light to medium occlusions. We use a subset of the qualitative part of the dataset to evaluate the plausibility of our method’s estimations. The data does not have pose annotations, but does contain the scanned scene mesh to which we fit a ground plane for plausibility evaluation. We obtain 2D pose, person masks, and ground plane initialization in the same way as for i3DB. We evaluate in Sec. 5.5 of the main paper on all videos from 4 chosen scenes (N3Office, N3Library, N0Sofa, and MPH1Library) that tend to have more dynamic motions and occlusions. In total, these scenes contain 12 unique people and about 19 minutes of video.
e.2 Baselines and Evaluation Metrics
Motion Prior Baselines
To be usable in our whole framework (testtime optimization with SMPL), the MVAE baseline is our proposed CVAE with all ablations applied simultaneously (no delta step prediction, no contact prediction, no SMPL losses, and no learned conditional prior). Note that this differs slightly from the model as presented in [52]: the decoder is an MLP rather than a mixtureofexperts and the layer sizes are larger to provide the necessary representational capacity for training on AMASS. All ablations and MVAE are trained in the exact same way as the full model. Additionally, when used in testtime optimization we use the same energy weightings as described in Tab. 5 but with irrelevant energies removed (the No Contacts ablation does not allow the use of ). Note that is still used with MVAE and all ablations, the only thing that changes is the prior in .
Motion Estimation Baselines
The VPosert baseline is exactly the initialization phase of our proposed testtime optimization, we use weightings in Tab. 5.
The PROXRGB baseline fits the neutral SMPLX [70] body model to the same 2D OpenPose detections used by our method. It does not use the face or hand keypoints for fitting similar to our approach. The PROXD baseline uses the fittings provided with the PROX dataset, which are on the known gendered SMPLX body model and use face/hand 2D keypoints for fitting.
The VIBE baseline uses the same 2D OpenPose detections as our method in order to define bounding boxes for inference. We found this makes for a more fair comparison since the realtime trackers used in their implementation^{3}^{3}3see Github often fail for medium to heavy occlusions common in our evaluation datasets.
Evaluation Metrics
In order to report occluded (Occ) and visible (Vis) positional errors separately, we must determine which joints/keypoints are occluded during evaluation. This is easily done for 3D tasks where “occlusions” are synthetically generated. For RGB data in i3DB, we use the person segmentation mask obtained with DeepLabv3 [17] to determine if a ground truth 3D joint is visible after projecting it to the camera.
For a joint
Comments
There are no comments yet.