1 Introduction
Recent years have witnessed significant development of markerless motion capture, which promotes a wide variety of applications ranging from character animation to humancomputer interaction, personal wellbeing, and human behavior understanding. Extensive existing works can kinematically capture accurate human pose from monocular videos and images via network regression [kanazawa2018end, kolotouros2019learning, kocabas2020vibe, zheng2019deephuman, zhang2020object] or optimization [pavlakos2019expressive, wang2017outdoor, rempe2021humor, fan2021revitalizing]. However, they are often hard to leverage in realworld systems due to a series of artifacts that are not satisfied biomechanical and physical plausibility (e.g., jitter and floor penetration).
To improve motion quality and physical plausibility, a few works focus on capturing human motion using physicsbased constraints. [wei2010videomocap, shimada2020physcap, shimada2021neural, xie2021physics, rempe2020contact] incorporate physical laws as soft constraints in numerical optimization framework and reduce artifacts. To make optimization be tractable, they can only adopt simple and differentiable physical models, which may result in high approximation errors. Other methods [Yu:2021:MovingCam, yuan2021simpoe, peng2018sfv]
utilize nondifferentiable physics simulators with deep reinforcement learning (DRL) to achieve accurate and physically plausible 3D human pose estimation. However, training a desirable policy requires complex configurations
[dulac2019challenges, li2017deep, andrychowicz2020matters], and it may be sensitive to environmental changes [peng2018deepmimic, Yu:2021:MovingCam]. The limitations above make them be infeasible to estimate human pose with scene interactions and subject varieties for motion capture tasks. Nevertheless, motion control, typically samplingbased methods [liu2010sampling], have achieved an impressive performance in reproducing highly dynamic and acrobatic motions and is robust to contactrich scenarios, which shows a way for general physicsbased motion capture.In this paper, we aim to construct a physicsbased motion capture framework that is more general to complex terrains, shape variations, and diverse behaviors along samplingbased motion control. However, employing samplingbased motion control in monocular motion capture tasks faces several challenges. First, conventional samplingbased methods [liu2015improving, liu2010sampling] often track the accurate reference motion from commercial motion capture systems, while the estimated motion from monocular RGB videos is noisy and physically implausible. An inaccurate contact results in an unnatural pose would even lead to an imbalance state for the character. Second, it is complicated to find an optimal distribution for the sampling. Although CMA (Covariance Matrix Adaptation) [hansen2006cma] is proved to be able to adjust distribution with blackbox optimization [liu2015improving], it requires evaluating plenty of samples for the distribution adaption, which is timeconsuming. Furthermore, the adaption relied on random samples from an initial distribution imposes uncertainty for the motion capture.
To address the obstacles, our keyidea is to train a motion distribution prior with physical supervisions. The prior provides feasible solutions for samplingbased motion control to capture physically plausible human motion from a monocular color video, which is named as Neural Motion Control (Neural MoCon). We first introduce a humanscene interaction constraint to obtain a reference motion with appropriate contacts for sampling. Different from existing works [shimada2020physcap, rempe2020contact] to detect footground contact status, our proposed interaction constraint adjusts the distance between two disconnected meshes via SDF, enforcing the human model to be close to the ground surface.
Then, we have tried to train an encoder to regress the distribution with KL divergence (KullbackLeibler divergence) and pseudo groundtruth from CMA. However, for the same character state and reference pose, the CMA method obtains different distributions, thus the stochastic error of CMA results in network divergence and erroneous regression. Consequently, we propose a novel twobranch decoder to address this obstacle. As shown in
Fig. 3, the target pose sampled from the estimated distribution is fed into a physical branch to verify the validity. Since the simulator is nondifferentiable, we use the output to supervise the pose decoder and enforce it to transfer the target pose to a dynamical pose like the simulator. Moreover, a reconstruction loss from the reference pose is applied to the decoded pose to promote correct distribution encoding. When the encoder is convergent, we use it to encode distribution and sample target poses for the physical branch to capture physically plausible motion. The main contributions of this work are summarized as follows.
We propose an explicit physicsbased motion capture framework that is more general to complex terrain, body shape variations, and diverse behaviors.

We propose a novel twobranch decoder to avoid stochastic error from pseudo groundtruth and train the distribution prior with a nondifferentiable physics simulator.

We propose an interaction constraint based on SDF to capture accurate humanscene contact from complex terrain scenarios.
2 Related Work
Physicsbased motion capture. VideoMocap [wei2010videomocap] first employs physical constraints in motion capture by jointly optimizing the human pose and contact force, and this approach requires manual intervention to achieve satisfying results. Based on [wei2010videomocap], [li2019estimating] and [shimada2020physcap, rempe2020contact, zell2017joint] further consider the object interaction and kinematic pose estimation, respectively. Recently, Shimada et al. [shimada2021neural]
proposed a neural networkbased approach to estimate the ground reaction force and joint force and updated the character’s pose using the derived accelerations. To make optimization tractable, their methods can only adopt simple and differentiable physics models with limited constraints, which results in high approximation errors. To address this problem, some latest works
[peng2018sfv, yuan2021simpoe, yuan2020residual, Yu:2021:MovingCam] employ DRL to implement motion capture based on nondifferentiable simulators. Nevertheless, training a desirable policy requires complex configurations [dulac2019challenges, li2017deep, andrychowicz2020matters], and it may be sensitive to motion types and body shape variations [peng2018deepmimic, Yu:2021:MovingCam]. Vondrak et al. [vondrak2012video] directly used the silhouette to construct a characterimage consistency to train a statemachine controller. However, this approach could only be generalized to a variety of motions, and the recovered motion seems to be unnatural. In this work, we adopt neural motion control to capture motion rather than DRL. With the trained distribution prior, our method is more general to different terrain interactions, human shape variations, and diverse behaviors.Physicsbased character control. Physicsbased character control is a longstanding problem [van1993sensor, wrotek2006dynamo, sharon2005synthesis, lee2014locomotion, lee2021learning, xiang2010physics]. Early works rely on the inverted pendulum model [kajita1991study], passive dynamics walking [kuo2001simple]
and zeromomentpointbased trajectory generation
[harada2006analytical]can handle simple motions. To solve largeDOF (degreeoffreedom) models, optimizationbased methods
[kim2008dynamic, xiang2009optimization, sok2007simulating, levine2012physically] are widely used to simulate and analyze human motions. However, it requires substantial computational effort to deal with a complex motion. Other methods [ye2010optimal, coros2010generalized] approximate the actual human control systems and can produce both normal and pathological walking motions. These controlbased methods can generalize to a variety of skills [coros2010generalized, ye2010optimal, liu2010sampling, liu2015improving, liu2012terrain], but a set of hyperparameters are required to tune for the desired behaviors. Recent works adopt DRL to control physical character
[peng2018deepmimic, xie2020allsteps, lee2021learning]. It shows that DRL can achieve highquality motion when motion capture data are provided as a reference [peng2018deepmimic]. Curriculum learning promotes the DRL to learn more complex tasks [xie2020allsteps]. However, training an optimal policy takes numerous low and highlevel design decisions, which strongly affect the performance of the resulting agents. We follow samplingbased motion control [liu2010sampling, liu2015improving] to construct a general framework. Furthermore, we propose a networkbased distribution prior to avoid the timeconsuming distribution adaption and to improve the stability for their methods.3D human with scene interaction. Modeling 3D human with scene interactions will promote the computational understanding of human behavior, which is important for metaverse and related applications. Previous works in scene labeling [jiang2013hallucinated], scene synthesis [fisher2015activity], affordance learning [grabner2011makes, kim2014shape2pose] and object arrangement [jiang2012learning]
verified human context is helpful for scene understanding. The prior knowledge of scene geometry can also promote a more reasonable and accurate human pose estimation.
[savva2016pigraphs, savva2014scenegrok, hassan2021populating, hassan2021stochastic] generate human motion with interaction from the relationship between scene geometry and human body parts. [monszpart2019imapper] further utilizes this relationship to recover interactions from videos. To explicitly use scene information to improve pose accuracy, [hassan2019resolving] formulates two constraints in optimization to reduce interpenetration and encourage appropriate contact. [zhang2021learning] also adopts the optimizationbased approach and proposes a smoothness prior to improve motion quality. However, numerical optimization with soft constraints is hard to avoid artifacts like interpenetration, which is the main concern for humanscene reconstruction. In contrast, our method relies on a physics simulator [coumans2021] to provide hard physical constraints. With the networkbased distribution prior, our method can obtain accurate terrain interactions via neural motion control.3 Method
We propose a framework with a nondifferentiable physics simulator [coumans2021] to capture physically plausible human motion. We first describe the representations of our kinematic and dynamical characters (Sec. 3.1). Then, an interaction constraint is designed to obtain reference motion with appropriate contact information (Sec. 3.2). In addition, we introduce a distribution prior trained with a novel twobranch structure for neural motion control (Sec. 3.3). Finally, we regress a distribution and sample satisfied target poses to track the estimated reference motion (Sec. 3.4).
3.1 Preliminaries
Representation. The kinematic motion is represented with SMPL model [loper2015smpl]. To represent different human shapes in the physics simulator, we design our physical character to have the same kinematic tree as SMPL. The bone length and link shape of the character can be directly obtained from the estimated SMPL parameters. We fix a few skeleton joints to have 57 DOFs. The state of character is denoted , where and are the pose and velocity, respectively. The details of the model can be found in the supplementary material.
Samplingbased motion control. We briefly review the samplingbased motion control approach [liu2010sampling] to promote understanding of our method. A kinematic pose is used as a reference, and we wish the physical character to dynamically track the reference pose via PDcontrol (Proportional Derivative). However, due to the inaccuracies of kinematic pose estimation and PD controller, the tracking always fails when directly applying the reference pose as the desired setpoint. The sampling algorithm samples a correction for reference pose, thus employing the target pose
can compensate the discrepancies. The quality of samples is evaluated by a loss function. By selecting the sample with the lowest loss, we can obtain the physically plausible motion. More details can be found in
[liu2010sampling].3.2 Reference motion estimation
The neural motion control requires reference motion with accurate ground contact to drive the physical character. To obtain the contact information, previous works [shimada2020physcap, Yu:2021:MovingCam] train a network to estimate a binary foot contact status. However, no sufficient data can be utilized for training in complex terrain scenarios (e.g., stairs and uneven ground). We address the problem by incorporating an SDFbased interaction constraint in an optimizationbased framework.
Specifically, we optimize the latent code of pretrained motion prior in [huang2021dynamic] to fit SMPL models to singleview 2D poses detected by AlphaPose [fang2017rmpe]. The overall formulation is:
(1) 
where are the latent code, global rotation and translation for character in each frame. is the human shape parameter, and is the frame length. The data term is:
(2) 
where , are 2D poses and their corresponding confidence. is the model joint position. We further add the regularization term:
(3) 
Due to the depth ambiguity, the recovered 3D human may float in the air or penetrate with the ground mesh with only the above constraints. With such reference motion, the simulated results are unnatural and incorrect. To reconstruct more accurate humanscene interactions from singleview videos, we generate a differentiable SDF of the scene mesh using [jiang2020coherent]. In the optimization, we follow [hassan2019resolving] to sample the SDF value for the predefined foot keypoints and use it to construct an objective function:
(4) 
where is the 3D positions of the keypoints and is the sample operation. Our optimization has four stages. Since the proximate motion can be obtained in the first three stages, we only apply the interaction term to refine the ground contact in the last stage. To make our method to be compliant with airborne motions, we further apply a GemanMcClure error function [geman1987statistical] to downweight keypoints that are far from the scene mesh.
3.3 Distribution prior training
It is essential to find an optimal target pose distribution to achieve physically plausible motion for samplingbased motion control. Previous works [liu2015improving] use CMAES method [hansen2006cma] to realize the distribution adaption. However, the timeconsuming operation and stochastic error of the adaption make it hard to be leveraged in motion capture for realworld applications. We propose to replace this operation and improve the performance with a networkbased distribution prior. To train the network, a naive idea is to directly supervise the distribution using the CMA results. Given a pair of character state and reference pose, it seems that we can provide the supervisions by running CMA online before feeding the data into the network or preparing the pseudo supervision with CMA in advance. Actually, the two strategies are both infeasible in real implementation. For the same character state and reference pose, the CMA method obtains different distributions, resulting in network divergence and erroneous regression for online and offline strategies, respectively.
To solve this obstacle, we propose a twobranch decoder to assist training an accurate and generalized distribution encoder. As shown in the Fig. 3, we first pretrain the distribution encoder with the supervision from offline CMA. Since the network parameters trained with the inaccurate supervisions are incorrect, we then introduce a physical branch to verify the validity of the sampled target pose. Due to the nondifferentiability of the simulator, we further design a pose decoder to intermediately employ physical supervision to train the distribution encoder.
Specifically, the KL divergence with pseudo groundtruth distributions is used to pretrain the encoder:
(5) 
where is the distribution prepared by CMAES method and is the estimated distribution. To improve the generalization ability, we sample correction of the reference pose from the estimated distribution, which is denoted as . Thus, the target pose is .
To optimize the distribution encoder with real physical supervision, the sampled target pose is fed to the nondifferentiable physics simulator to get the simulated pose. We design a pose decoder to imitate the physical branch by supervising it with the simulated pose.
(6) 
where , and , are pose and joint positions of the estimated result and the simulated result, respectively. In addition, a reconstruction loss is applied to enforce optimal distribution encoding:
(7) 
With the pose decoder, the encoder can gradually encode valid distribution to sample effective poses in the simulator. We further add a regularization term to ensure the network will not be easily overfitted:
(8) 
We reduce the weight of KL loss when training with the twobranch decoder. The overall loss function is:
(9) 
The is 0.2 in our experiments. When the training is finished, the encoder is utilized to construct a neural motion capture framework in Sec. 3.4.
3.4 Motion capture with neural motion control
With the trained distribution prior, we then capture human motion by tracking the kinematic reference motion by a sampling strategy. As shown in Fig. 2, the reference pose and the current state of character are first fed into the prior to encode target pose distribution. Then, we sample target poses and simulate them in the simulator. The quality of each sample is evaluated with characterlevel and imagelevel loss functions. The sample with the lowest loss will be adopted for the next frame. Since the reference motions from uneven terrains are noisy, we design several loss functions to evaluate sample quality.
The loss between simulated pose and reference pose is first used to measure the pose and joint position consistency.
(10) 
We find that the dynamical state of the character is critical for physicsbased motion capture. We then introduce a dynamical loss to evaluate the velocity consistency:
(11) 
where and are joint angular velocity and linear velocity, respectively. To let the physical character keep balance, we follow [liu2010sampling] to add a balance term to adjust CoM (Center of Mass):
(12) 
where,
, which denotes the planar vector from endeffector
to CoM. The is the linear velocity of CoM and is number of endeffectors.Different from DRL, we can directly use image features to evaluate the quality of the sample. With 2D pose and corresponding confidence, the imagelevel loss makes our method more robust to occlusion scenarios:
(13) 
The overall loss function for the sampling procedure is:
(14) 
Finally, the sample with the lowest loss in each frame consists of a complete physically plausible human motion.
4 Experiments
In this section, we conduct several qualitative and quantitative experiments to demonstrate the effectiveness of our method. We first introduce the implementation details and datasets in Sec. 4.1 and Sec. 4.2. Then, the comparisons with the stateofthearts are shown in Sec. 4.3. Finally, ablation studies in Sec. 4.4 are conducted to evaluate key components.
4.1 Metrics
The common metrics of the Mean Per Joint Position Error (MPJPE) and the MPJPE after rigid alignment of the prediction with ground truth using Procrustes Analysis (MPJPEPA) are used to evaluate joint accuracy. To evaluate physical plausibility, we use the metrics proposed in [shimada2020physcap] and [xie2021physics] to measure motion jitter and foot contact. is the difference in joint velocity magnitude between the ground truth motion and the predicted motion.
and its standard deviation
are used to assess motion smoothness. is the foot position error on zaxis. We adopt this metric to evaluate foot floating artifacts. More details can be found in their original paper.4.2 Datasets
Human3.6M [h36m_pami] is a largescale dataset, which consists of 3.6 million 3D human poses and corresponding images. Following previous work [yuan2021simpoe], we train our model on 5 subjects (S1,S5,S6,S7,S8), and test on the other subjects (S9,S11) with 25Hz.
GPA [wang2019geometric] is a 3D human dataset with both humanscene interactions and groundtruth scene geometries. It utilizes a commercial motion capture system to collect data. The sequence 0, 34, 52 are used to test, and the rest are served as training data. With the scene geometries, we verify the performance of our method on more complex terrains.
3DOH [zhang2020object] is the first dataset to handle the objectoccluded human body estimation problem, which contains 3D motions in occludedscenarios. We use the sequence 13, 27, 29 in this dataset to evaluate our method on occlusion cases.
GTAIM [cao2020long]. Since there are limited groundtruth terrain data, we use this synthetic dataset as additional humanscene interaction cases. The scene meshes are recovered from the depth map. We conduct qualitative experiments on this dataset.




MPJPE  PAMPJPE  


HuMoR [rempe2021humor]  97.5  68.5  24.2  25.9  43.2  
DMMR [huang2021dynamic]  96.0  67.4  14.4  12.6  48.6  
VIBE [kocabas2020vibe]  65.9  41.5  25.5  25.7  34.0  
EgoPose [yuan2019ego]  130.3  79.2  –  –  –  
PhysCap [shimada2020physcap]  97.4  65.1  7.2  6.9  –  
SamCon [liu2015improving]  78.4  63.2  4.0  4.3  20.4  
NeuralPhysCap [shimada2021neural]  76.5  58.2  4.5  6.9  –  
Xie et al. [xie2021physics]  68.1  –  4.0  1.3  18.9  
SimPoE [yuan2021simpoe]  56.7  41.6  –  –  –  
Ours 
72.5  54.6  3.8  2.4  14.4  





3DOH  GPA  
MPJPE  PAMPJPE  MPJPE  PAMPJPE  


DMMR [huang2021dynamic]  102.9  65.8  16.2  107.0  87.4  32.8  
VIBE [kocabas2020vibe]  98.1  61.8  26.5  114.3  80.6  36.4  
HuMoR [rempe2021humor]  105.1  60.6  21.9  117.2  86.3  58.7  
SamCon [liu2015improving]  102.4  95.4  9.7  104.7  87.1  28.3  
PhysCap [shimada2020physcap]  107.8  93.3  12.2  103.4  91.2  36.1  
Ours  93.4  86.7  9.2  94.8  80.3  21.2  

4.3 Comparison to stateoftheart methods
There are several kinematic and dynamical approaches that report results on Human3.6M datasets. As shown in Tab. 1, we first evaluated our method on this dataset to demonstrate that our neural motion control works well on flat ground. [kocabas2020vibe, huang2021dynamic, rempe2021humor] are recent works to estimate kinematic SMPL parameters. Although the explicit dynamics of the human model are not considered, [huang2021dynamic, rempe2021humor] learn implicit dynamics via VAE and improve physical plausibility by using prior knowledge. The rest methods in Tab. 1 are dynamicsbased methods. Specifically, SamCon [liu2015improving] is designed for animation. We used this method to track our kinematic motion and adopted it as a baseline to compare among samplingbased methods.
In Tab. 1, we found that VIBE achieves the best performance in terms of PAMPJPE. It relies on a GRUbased network to build correspondences among different frames. However, directly regressing kinematic SMPL parameters causes the largest smoothness error and results in visually noticeable motion jitter. Furthermore, VIBE shows a severe penetration with the ground in Fig. 4. Due to model discrepancies between the motion capture subject and the physical character, the joint position error for dynamicsbased methods is higher than kinematicsbased approaches. SimPoE [yuan2021simpoe] utilizes a model with a similar shape as Human3.6M subjects and get comparable results to VIBE. However, for different subjects with the variation of body proportion and shape, this method requires to retrain the policy. Benefited from the proposed target pose distribution prior, our method can adapt to shape variation. Thus, we can update the bone length of the physical character model with the estimated human shape and directly use it to capture human motion from images. Our method also obtained smooth motion and achieves stateoftheart in terms of .
We then compared our method to others on the 3DOH dataset. It is tricky to obtain accurate reference motion for occlusion cases. As shown in the 5th column of Fig. 4, the inaccurate reference motion will result in a large deviation between 3D pose and image observation for other physicsbased methods. However, due to the imagelevel loss, our method got more accurate results. Moreover, SamCon also based on a sampling approach to get human motion. The results in Tab. 2 and Fig. 4 show that our networkbased distribution prior can get more appropriate distribution and then produce natural and precise motion.
On the GPA dataset, we evaluated our method with complex terrains. The interactions with objects and terrains impose great difficulty for kinematicsbased methods. The estimated poses float on the air or penetrate with the scene mesh for their methods (Fig. 4). Since PhysCap uses a numerical optimization framework with soft physical constraints to capture human motion, the results also show physical artifacts. The qualitative and quantitative results on GPA dataset in Fig. 4 and Tab. 2 show that neural motion control is more proper for contactrich scenarios.
4.4 Ablation studies
Twobranch decoder. As mentioned before, directly supervising the distribution encoder without twobranch decoder will result in erroneous regression. In Fig. 6 and Tab. 3, we conducted comparisons between the distribution prior trained with and without the twobranch decoder. Without the decoder, the encoder can not regress correct distribution to sample a valid target pose, thus causing an unsatisfied simulated pose. The quantitative results in Tab. 3 show that the twobranch decoder induces major improvement and demonstrate that it is the most important component for our method.
Distribution prior. We compared different methods of distribution generation to verify the superiority of our distribution prior. We first replaced the distribution encoder with uniform distribution with a predefined range. The results in Fig. 6 show that it can not generalize to a large variety motion types. As shown in Tab. 3, since there is a stochastic error for the CMA method, the gaussian distribution with CMA adaption is inferior to the distribution encoder.
Interaction constraint. We further conducted several experiments to illustrate the necessity of the interaction constraint. Due to the visual ambiguity, it is difficult to reconstruct accurate humanscene interactions with complex terrains (Fig. 7). In Tab. 3, the optimization with interaction constraint gets more accurate foot position on GPA dataset. In addition, an inaccurate contact seriously affects the performance of samplingbased motion control. Fig. 6 (a) shows a reference pose floating on the air can trigger improper simulated pose. The gap between the results of the method with and without this constraint on GPA is greater than that on 3DOH in Tab. 3, which proves its importance for motion capture on complex terrains.




GPA  3DOH  
MPJPE  PAMPJPE  MPJPE  


DMMR [huang2021dynamic]  107.0  87.4  32.8  102.9  16.2  
Kinematic  106.2  87.2  27.3  94.4  16.5  
w/o twobranch  142.2  126.7  28.4  136.8  13.2  
w/ Uniform Dist.  136.6  119.1  29.6  142.1  10.3  
w/o Inter. Cons.  116.4  109.4  24.3  93.4  9.4  
w/ Gaussian CMA  103.9  84.4  23.5  95.4  9.8  
w/o imagelevel loss  95.8  84.4  21.3  96.3  9.7  
w/ GT reference  93.6  80.0  17.3  89.6  9.2  
Neural MoCon  94.8  80.3  21.2  93.4  9.2  

5 Limitation and future work
Although our method can obtain physically plausible human motion via neural motion control, there are some limitations for the current implementation. First, the discrepancy between the geometric primitives of our character and the real human body makes our method unable to reconstruct accurate body contact (e.g., Lying on the sofa). To solve this problem, building a more delicate character model like [yuan2021simpoe] may be a feasible approach. Second, the cumulative error of an undesirable sample may result in failure to sample a long sequence. Future work can integrate longterm temporal information in the sampling. Finally, due to a lack of groundtruth terrain data, we can only evaluate our method on similar interactions like stairs for motion capture tasks. Therefore, to build a largescale humanscene interaction dataset for human motion capture in complex scenarios is also worthwhile.
Among Neural MoCon, DRLbased methods, and traditional samplingbased motion control, DRL can obtain highly accurate results for a specific task, and sampling control is more general to unknown scenarios. Neural MoCon is in between these two typical technical approaches. To combine the accuracy of DRL and the generalization ability of sampling control may be a potential direction to promote future physicsbased motion capture.
6 Conclusion
In this paper, we propose a framework to capture physically plausible human motion with complex terrain interactions, human shape variations, and diverse behaviors. We first introduce an interaction constraint based on SDF in optimization to estimate accurate humanscene contact. Then, a novel twobranch decoder is designed to train a distribution prior with real physical supervision. With the trained prior and the estimated reference motion, several loss functions are used to select a satisfied sample to consist of a complete human motion. The proposed method has better generalization ability than DRLbased methods and gets more accurate results than conventional samplingbased motion control.
References
Supplementary Material
Appendix A Implementation details
We adopt PyBullet [coumans2021]
as simulator. The control frequency is 240HZ, and the coefficient of friction is 0.9. Since the frame rate among videos is different, we apply linear interpolation on the estimated motion between two frames to obtain reference pose and velocity. The frequency of sampling is 30HZ. To train the distribution prior, we implement the neural network based on PyTorch
[paszke2019pytorch]. The distribution encoder and the pose decoder have six and four fullyconnected layers, respectively, with batch normalization and LeakyReLU
[maas2013rectifier]activation function. The AdamW [loshchilov2017decoupled] optimizer with a learning rate of 0.0001 is used to train the network. On a desktop with an Intel(R) Core(TM) i911900F CPU and a GPU of NVIDIA GeForce RTX 3090, one sample takes about 0.0002s without any implementation acceleration strategy. We sample 1000 samples for a target pose and save 20 samples as the start state for the next target pose.a.1 Physical character creation
In this section, we explain the details of physical character creation with different body shape variations. To represent the kinematic and dynamical model in a unified framework, we design the kinematic tree of the physical character to be the same as the SMPL [loper2015smpl]. According to the estimated SMPL shape parameters, we automatically generate a new character. With the joint regressor in SMPL, we obtain the length of each bone of the estimated SMPL model in Tpose. Since the bones in symmetrical parts have minor difference, we calculate the average length and correct the rotation for each bone, and build the skeleton based on parentchild relationship. Further, we determine the link shape with the created skeleton. The physical characters with different shapes are shown in Fig. 8. In addition, we do not control the hand and foot, so that these joints are fixed. Since the control parameters are dramatically affected by mass, all characters in our experiments have the same mass. The details of the character model are described in Tab. 4.
a.2 CmaEs
The CMAES (covariance matrix adaptation evolution strategy) [hansen2006cma] is a blackbox optimization method. We implement this algorithm with [Hansen16a]
to prepare pseudo groundtruth for distribution prior training. The mean and the variance have the same dimension as target pose, which is 51. The number of maximum resampling is 100 and the population size is 6 in our experiments. The distribution evolves 30 generations for a given character state and reference pose. To get more natural motion, we limit the sampling bounds, which is shown in
Tab. 6.



Type  Geometry  Mass  Num  Kp  Kd  Force Limit  Inertia(xx)  Inertia(xy)  Inertia(xz)  Inertia(yy)  Inertia(yz)  Inertia(zz)  




Lower Neck  revolute  capsule  0.5  1  200  20  100  0.001  0.0  0.0  0.001  0.0  0.001  
Upper Neck  revolute  capsule  3.0  1  200  20  100  0.001  0.0  0.0  0.001  0.0  0.001  
Chest  revolute  sphere  8.0  1  500  50  300  0.001  0.0  0.0  0.001  0.0  0.001  
Lower Back  revolute  sphere  5.0  1  500  50  300  0.001  0.0  0.0  0.001  0.0  0.001  
Upper Back  revolute  sphere  5.0  1  500  50  300  0.001  0.0  0.0  0.001  0.0  0.001  
Clavicle  revolute  capsule  1.0  2  400  40  200  0.001  0.0  0.0  0.001  0.0  0.001  
Shoulder  revolute  box  2.0  2  400  40  200  0.001  0.0  0.0  0.001  0.0  0.001  
Elbow  revolute  box  1.0  2  300  30  150  0.001  0.0  0.0  0.001  0.0  0.001  
Wrist  fixed  sphere  0.5  2  –  –  –  0.001  0.0  0.0  0.001  0.0  0.001  
Hip  revolute  capsule  5.0  2  500  50  300  0.001  0.0  0.0  0.001  0.0  0.001  
Knee  revolute  capsule  3.0  2  400  40  200  0.001  0.0  0.0  0.001  0.0  0.001  
Ankle  revolute  box  1.0  2  300  30  100  0.001  0.0  0.0  0.001  0.0  0.001  




Value  Simulator Property  Value  




Joints 
19  Gravity  9.81  
Movable Joints  17  Time Step  1/240.0  
Fixed Joints  2  NumSolverIterations  10  
Links  19  NumSubSteps  2  
Total Mass (kg)  53.5  
Degrees of Freedom  57  
Lateral Friction Coefficient  0.9  
Rolling Friction Coefficient  0.3  
Restitution Coefficient  0.0  




Joint  x  +x  y  +y  z  +z 


Left Hip 
2.0  2.0  0.57  0.57  0.27  0.27 
Left Knee  0.3  1.57  0.27  0.27  0.0  0.0 
Left Ankle  0.57  0.57  0.57  1.2  0.57  0.57 
Right Hip  2.0  2.0  0.57  0.57  0.27  0.27 
Right Knee  0.3  1.57  0.27  0.27  0.0  0.0 
Right Ankle  0.57  0.57  1.2  0.57  0.57  0.57 
Lower Back 
1.57  1.57  1.57  1.57  1.57  1.57 
Upper Back  1.57  1.57  1.57  1.57  1.57  1.57 
Chest  1.57  1.57  1.57  1.57  1.57  1.57 
Lower Neck  0.57  0.57  0.0  0.0  0.0  0.0 
Upper Neck  0.57  0.57  0.57  0.57  0.0  0.0 
Left Clavicle 
1.57  1.57  1.57  1.57  1.57  1.57 
Left Shoulder  1.57  1.57  1.57  1.57  1.57  1.57 
Left Elbow  1.57  1.57  1.57  1.57  1.57  1.57 
Right Clavicle  1.57  1.57  1.57  1.57  1.57  1.57 
Right Shoulder  1.57  1.57  1.57  1.57  1.57  1.57 
Right Elbow  1.57  1.57  1.57  1.57  1.57  1.57 




Stage  data  latent prior  shape prior  kinetic prior  interaction term 


Stage1 
1.0  4040.0  100.0  1000.0  0.0 
Stage2  1.0  404.0  50.0  500.0  0.0 
Stage3  1.0  57.4  10.0  250.0  0.0 
Stage4  1.0  1.78  5.0  200.0  4500.0 

a.3 Training details
We introduce the training details of our distribution prior in this section. As mentioned in the main paper, the train set from Human3.6M and GPA are used for training. We first apply the CMAES method to get pseudo groundtruth. Since generating sampled target pose for a complete motion sequence is difficult and timeconsuming, we select two consecutive frames from the dataset and calculate the state of character from the dataset with linear interpolation. The kinematic pose in the second frame is used as reference. We then apply CMAES method to obtain the target pose distribution. When the prior is convergent, we finish the pretrain procedure and incorporate the twobranch decoder to refine the network. It is an ideal situation to have the character state as the same as the state from linear interpolation. In the training phase, we add random noises in the character state to simulate the discrepancy of real simulation. The distribution prior is trained on a single NVIDIA TITAN RTX GPU with a learning rate of 0.0001 and a batch size of 32.
a.4 Sampling details
We describe implementation details of the neural motion control. The character state consists of pose and velocity . The first 6 dimensions of pose are global translation and global rotation. The rest are joint rotations that are represented by axisangle. Besides, contains 3dimension base linear velocity, 3dimension base angular velocity and 51dimension joint angular velocity. The total dimension of character state is 114. In the physics simulator, the rotations are represented by quaternion.
We do not directly control the root joint, thus the target pose has the same dimension as the DOF (degreeoffreedom) of moveable joints, which is 51. Given a target pose, we first compute torques from the PD controller and limit the torques in a reasonable range. The parameters are shown in Tab. 4. Finally, the torques are applied to the character via torque control mode. We simulate 8 times for a given target pose and recalculate the torques based on the target pose and simulated pose in each time. When applying the distribution prior in neural motion control, we use the first frame of reference motion to initialize the physical character.
a.5 Optimization details
Our optimization has four stages. The only difference among each stage is the loss weights for each term. The different loss weights promote the optimized results from coarse to fine. As shown in Tab. 7, the optimization with the weights in the first three stages can obtain proximate results. We only apply the interaction constraint in the last stage to get accurate ground contact.
a.6 Details on SDF
The SDF representation is similar to [hassan2019resolving], in which the scene is used to constrain a single human pose. We use a uniform voxel grid with the size 256256256 to represent the field. The trilinear interpolation is used for the discretization of the 3D distance field with the limited grid resolution. The resolution is enough to obtain coarse contact for physicsbased motion capture with reasonable computational complexity and memory consumption.
Appendix B Success rate
Since our method is based on sampling, the reconstructed control is not guaranteed to be successful after a single run of the sampling algorithm [liu2010sampling]. The character may fall and be unable to finish the complete motion. Thus, the success rate is an important metric to evaluate our method. For samplingbased motion control, the sampling distribution is the most important influencing factor for the metric of success rate. CMAESbased method [liu2015improving] learn from previous trials to update the distribution via online adaptation and might require more trials for motion capture tasks. The initial several trials draw samples randomly and blindly and are very likely to fail, which results in a low overall success rate. The limitation is also demonstrated in a recent work [xie2021inverse]. With the welltrained prior, our method samples from the regressed distribution and has a high success rate for all trials. In Tab. 8, we follow [xie2021inverse] and [liu2010sampling] to conduct a comparison on the lift leg motion. The success rate is 97% and 90% for our method and [liu2015improving], but [liu2010sampling] is 83%. The comparisons illustrate the advantage of our approach from a different perspective. In addition, to increase the number of samples and saved samples at each iteration can improve the success rate. Furthermore, when the tracking is fail, we can also run multiple times on the same problem to allow a user to explore different possible reconstructions.


Method  Liu et al. [liu2010sampling]  Liu et al. [liu2015improving]  Neural MoCon 


Success rate  83%  90%  97% 

Appendix C More results and discussions
We show more qualitative results in this section to demonstrate the performance of our method. In Fig. 9, the results on GPA, 3DOH and Human3.6M dataset show that our method is robust to complex terrains, occlusions and body shape variations. Furthermore, we apply the estimated pose from neural motion control to SMPL model and get the skinning mesh. It shows the obtained meshes are natural and accurate. Since the collision detection is conducted on the primitives of physical character, the shape discrepancies of hand and foot cause a slight interpenetration on the skinning mesh. We will detect meshlevel collision or design more delicate characters to prevent these artifacts in the future work.
Appendix D Video
In the video, we show the qualitative comparisons with VIBE, DMMR, and PhysCap. Due to the hard physical constraints, our method can prevent floor interpenetration. Most of the foot sliding is also avoided by applying lateral friction. To demonstrate the performance of our method on complex terrain, we conducted a comparison with PhysCap on the GPA dataset. We used the original character model of PhysCap. Although PhysCap can obtain smooth motion, the wrong contact states for uneven terrain scenario result in floating motion. However, since the interaction constraint is used in kinematic optimization, our method can produce a physically plausible and highquality motion with the proposed neural motion control.
Appendix E Why neural motion control?
To build a 3D human dataset with accurate force annotations is complex and expensive [zell2020weakly]. The joint torques cannot be measured nonintrusively and therefore need to be derived using computationally expensive optimization techniques. Furthermore, the torques for different subjects with different body shapes have large variances. It results in a poor generalization ability for the network that directly regresses joint torques. However, the distribution prior is a network that regresses the target pose distribution. With the dense supervision from pseudo ground truth and the twobranch decoder, the network is easy to be convergent. In addition, with the sampling, the neural motion control is more general to complex terrain, body shape variations, and diverse behaviors.
Compared to CMAES based method, the existing samplingbased motion control first relies on CMAES to adapt the distribution via evaluating plenty of samples, which is timeconsuming. Our networkbased prior avoids such distribution adaptation, and elite samples can be directly obtained from the regressed distribution. Our method saves a lot of sample evaluations compared to [liu2015improving]. Furthermore, the distribution adaptation relies on random samples from an initial distribution to update the distribution via CMAES, which imposes uncertainty for the motion capture. The proposed prior avoids the uncertainty, and the precise control can be acquired by sampling from the distribution of network output, which is the same as CMAESbased approaches. Combing neural networks and samplingbased motion control provides a feasible solution to achieve realtime physicsbased motion capture though there is still a gap for this goal.