We are More than Our Joints: Predicting how 3D Bodies Move

by   Yan Zhang, et al.

A key step towards understanding human behavior is the prediction of 3D human motion. Successful solutions have many applications in human tracking, HCI, and graphics. Most previous work focuses on predicting a time series of future 3D joint locations given a sequence 3D joints from the past. This Euclidean formulation generally works better than predicting pose in terms of joint rotations. Body joint locations, however, do not fully constrain 3D human pose, leaving degrees of freedom undefined, making it hard to animate a realistic human from only the joints. Note that the 3D joints can be viewed as a sparse point cloud. Thus the problem of human motion prediction can be seen as point cloud prediction. With this observation, we instead predict a sparse set of locations on the body surface that correspond to motion capture markers. Given such markers, we fit a parametric body model to recover the 3D shape and pose of the person. These sparse surface markers also carry detailed information about human movement that is not present in the joints, increasing the naturalness of the predicted motions. Using the AMASS dataset, we train MOJO, which is a novel variational autoencoder that generates motions from latent frequencies. MOJO preserves the full temporal resolution of the input motion, and sampling from the latent frequencies explicitly introduces high-frequency components into the generated motion. We note that motion prediction methods accumulate errors over time, resulting in joints or markers that diverge from true human bodies. To address this, we fit SMPL-X to the predictions at each time step, projecting the solution back onto the space of valid bodies. These valid markers are then propagated in time. Experiments show that our method produces state-of-the-art results and realistic 3D body animations. The code for research purposes is at https://yz-cnsdqz.github.io/MOJO/MOJO.html



There are no comments yet.


page 1

page 7

page 13

page 15

page 16


Generating Smooth Pose Sequences for Diverse Human Motion Prediction

Recent progress in stochastic motion prediction, i.e., predicting multip...

Human Motion Prediction via Learning Local Structure Representations and Temporal Dependencies

Human motion prediction from motion capture data is a classical problem ...

Recovering Trajectories of Unmarked Joints in 3D Human Actions Using Latent Space Optimization

Motion capture (mocap) and time-of-flight based sensing of human actions...

The Wanderings of Odysseus in 3D Scenes

Our goal is to populate digital environments, in which the digital human...

Perpetual Motion: Generating Unbounded Human Motion

The modeling of human motion using machine learning methods has been wid...

Capturing Detailed Deformations of Moving Human Bodies

We present a new method to capture detailed human motion, sampling more ...

Human Motion Prediction via Pattern Completion in Latent Representation Space

Inspired by ideas in cognitive science, we propose a novel and general a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human motion prediction has been extensively studied as a way to understand and model human behavior. Provided the recent past motion of a person, the goal is to predict either a deterministic or diverse set of plausible motions in the near future. While useful for animation, AR, and VR, predicting human movement is much more valuable because it means we have a model of how people move. Such a model is useful for applications in sports [57], pedestrian tracking [45], smart user interfaces [52], robotics [31] and more. While this is a variant of the well-studied time-series prediction problem, most existing methods are still not able to produce realistic 3D body motion.

To address the gap in realism, we make several novel contributions but start with a few observations. First, most existing methods for 3D motion prediction treat the body as a skeleton and predict a small set of 3D joints. While some methods represent the skeleton in terms of joint angles, the most accurate methods simply predict the 3D joint locations in Euclidean space. Second, given a sparse set of joint locations, animating a full 3D body is ambiguous because important degrees of freedom are not modeled, e.g. rotation about limb axes. Third, most papers show qualitative results by rendering the skeletons and these typically look fine to the human eye. We will show, however, that, as time progresses, the skeletons can become less and less human in proportion so that, at the end of the sequence, the skeleton rarely corresponds to a valid human body. Fourth, the joints of the body cannot capture the nuanced details of how the surface of the body moves, limiting realism of any resulting animation.

Given that a key goal of these methods is to animate human bodies, our observations are troubling. By addressing them, we provide a solution, called MOJO (More than Our JOints), for realistic 3D body motion prediction that coherently incorporates a novel representation of the body in motion, a novel motion generative network, and a novel scheme for 3D body mesh recovery.

First, the set of 3D joints predicted by existing methods can be viewed as a sparse point cloud. In this light, existing human motion prediction methods preform point cloud prediction. Thus, we are free to choose a different point cloud that better satisfies the ultimate goal of animating bodies. Specifically, we model the body with a sparse set of surface markers corresponding to those used in motion capture (mocap) systems. We simply swap one type of sparse point cloud for another, but, as we will show, predicting surface markers has key advantages. For example, there exist methods to fit a SMPL body model [33] to such makers, producing realistic animations [32, 34]. Consequently this shift to predicting makers enables us to (1) leverage a powerful statistical body shape model to improve results, (2) immediately gives us realistic animations, (3) provides an output representation that can be used in many applications.


, to model fine-grained and high-frequency interactions between markers, we design a conditional variational autoencoder (CVAE) with a latent cosine space. It not only performs stochastic motion prediction, but also improves motion realism by incorporating high-frequency motion details. Compared to most existing methods that encode motion with a single vector (e.g. the last hidden state of an RNN), our model preserves full temporal resolution of the sequence, and decomposes motion into independent frequency bands in the latent space via a discrete cosine transform (DCT). Based on the energy compaction property of the DCT 

[3, 43]111Most information will be concentrated at low frequency bands.

, we train our CVAE with a robust Kullback–Leibler divergence (KLD) term 

[58], creating an implicit latent prior that carries most of the information at low frequency bands. To sample from this implicit latent prior, we employ diversifying latent flows (DLows) [55]

in low-frequency bands to produce informative features, and from the standard normal distribution in high-frequency bands to produce white noise. Pieces of information from various frequencies are then fused to compose the output motion.

Third, in the testing phase, we propose a recursive projection scheme supported by our marker-based representation, in order to retain natural body shape and pose throughout the sequence. We regard the valid body space as a low-dimensional manifold in the Euclidean space of markers. When the CVAE decoder makes a prediction step, the predicted markers tend to leave this manifold because of error accumulation. Therefore, after each step we project the predicted markers back to the valid body manifold, by fitting an expressive SMPL-X [39] body mesh to the markers. On the fitted body, we know the true marker locations and pass these to the next stage of the RNN, effectively denoising the markers at each time instant. Besides keeping the solution valid, the recursive projection scheme directly yields body model parameters and hence realistic body meshes.

We exploit the AMASS [34] dataset for evaluation, as well as Human3.6M [25] and HumanEva-I [46] to compare our methods with the state-of-the-art in stochastic motion prediction. We show that our models with the latent DCT space outperform state-of-the-art, and the recursive projection scheme effectively eliminates unrealistic body deformation. We also evaluate realism of the generated motion with a foot skating measure and a perceptual user study. Finally, we compare our solution with a traditional pipeline, which first predicts 3D joints and then fits a body to the joints. We find that they are comparable w.r.t. prediction diversity and accuracies, but the traditional pipeline has a risk to fail due to bone deformation.

Contributions. In summary, our contributions are: (1) We propose a marker-based representation for bodies in motion, which provides more constraints than the body skeleton and hence benefits 3D body recovery. (2) We design a new CVAE with a latent DCT space to improve motion modelling. (3) We propose a recursive projection scheme to preserve the body valid during motion. Combining all three aspects, we exploit our bodies more than our joints. Therefore, we name our solution as MOJO.

2 Related Work

Deterministic human motion prediction. Given an input human motion sequence, the goal is to forecast a deterministic future motion, which is expected to be close to the ground truth. This task has been extensively studied [5, 6, 10, 13, 15, 16, 17, 18, 28, 29, 35, 36, 40, 41, 47, 49, 60]. Martinez et al. [36]

propose an RNN with residual connections linking the input and the output, and design a sampling-based loss to compensate for prediction errors during training. Cai et al. 

[10] and Mao et al. [35]

use the discrete cosine transform (DCT) to convert the motion into the frequency domain. Then Mao et al. 

[35] employ graph convolutions to process the frequency components, whereas Cai et al. [10] use a transformer-based architecture.

Stochastic 3D human motion prediction. In contrast to deterministic motion prediction, stochastic motion prediction produces diverse plausible future motions, given a single motion from the past [7, 8, 14, 19, 30, 48, 53, 55, 58]. Yan et al. [53] propose a motion transformation VAE to jointly learn the motion mode feature and transition between motion modes. Barsoum et al. [7]

propose a probabilistic sequence-to-sequence model, which is trained with a Wasserstein generative adversarial network. Bhattacharyya et al. 

[8] design a ‘best-of-many’ sampling objective to boost the performance of conditional VAEs. Gurumurthy et al. [19] propose a GAN-based network and parameterize the latent generative space as a mixture model. Yuan et al. [55] propose diversifying latent flows (DLow) to exploit the latent space of an RNN-based VAE, which is able to generate highly diverse but accurate future motions.

Frequency-based motion analysis. Earlier studies like [38, 42]

adopt a Fourier transform for motion synthesis and tracking. Akhter et al. 

[4] propose a linear basis model for spatiotemporal motion regularization, and discover that the optimal PCA basis of a large set of facial motion converges to the DCT basis. Huang et al. [24] employ low-frequency DCT bands to regularize motion of body meshes recovered from 2D body joints and silhouettes. The studies of [10, 35, 49]

use deep neural networks to process DCT frequency components for motion prediction. Yumer et al. 

[56] and Holden et al. [23] handle human motions in the Fourier domain to conduct motion style transfer.

Representing human bodies in motion. 3D joint locations are widely used, e.g. [29, 36, 55]. To improve prediction accuracy, Mao et al. [35], Cui et al. [13], Li et al. [29] and others use a graph to capture interaction between joints. Askan et al. [6] propose a structured network layer to represent the body joints according to a kinematic tree. Despite their effectiveness, the skeletal bone lengths can vary during motion. To alleviate this issue, Hernandez et al. [22] use a training loss to penalize bone length variations. Gui et al. [17] design a geodesic loss and two discriminators to keep the predicted motion human-like over time. To remove the influence of body shape, Pavllo et al. [40, 41]

use quaternion-based joint rotations to represent the body pose. When predicting the global motion, a walking path is first produced and the pose sequence is then generated. Zhang et al. 

[58] represent the body in motion by the 3D global translation and the joint rotations following the SMPL kinematic tree [33]. When animating a body mesh, a constant body shape is added during testing. Although a constant body shape is preserved, foot skating frequently occurs due to the inconsistent relation between the body pose, the body shape and the global translation.

MOJO in context. Our solution not only improves stochastic motion prediction over the state-of-the-art, but also directly produces diverse future motions of realistic 3D bodies. Specifically, our latent DCT space represents motion with different frequencies, rather than a single vector in the latent space. We find that generating motions from different frequency bands significantly improves diversity while retaining accuracy. Additionally, compared to previous skeleton-based body representations, we propose to use markers on the body surface to provide more constraints on the body shape and DoFs. This marker-based representation enables us to design an efficient recursive projection scheme by fitting SMPL-X [39] at each prediction step. In contrast to previous methods, with special losses or discriminators, our solution works during the testing phase, and hence does not suffer from the domain gap problem. More recently, Weng et al. [50, 51] propose to first forecast future LiDAR point clouds and then detect-and-track 3D objects from the predicted point clouds. Despite a similar idea with MOJO, they focus on autonomous driving, and do not address articulate and complex human movements.

3 Method

3.1 Preliminaries

SMPL-X body mesh model [39]. Given a compact set of body parameters, SMPL-X produces a realistic body mesh including face and hand details. In our work, the body parameter set includes the global translation , the global orientation w.r.t. the continuous representation [59], the body shape , the body pose in the VPoser latent space [39], and the hand pose in the MANO [44] PCA space. We denote a SMPL-X mesh as , which has a fixed topology with 10,475 vertices.

Diversifying latent flows (DLow) [55]. The entire DLow method comprises a CVAE to predict future motions, and a network Q that takes the condition sequence as input and transforms a sample to diverse places in the latent space. To train the CVAE, the loss consists of a frame-wise reconstruction term, a term to penalize the difference between the last input frame and the first predicted frame, and a KLD term. Training the network Q requires a pre-trained CVAE decoder, and its training loss encourages one generated motion to reconstruct the ground truth, and also encourages the diversity of all generated motions.

3.2 Human Motion Representation

Most existing methods use 3D joint locations or rotations to represent the body in motion. This results in ambiguities in recovering the full shape and pose [9, 39]. To obtain more constraints on the body, while preserving efficiency, we represent bodies in motion with markers on the body surface. Inspired by modern mocap systems, we follow the marker placement of the CMU mocap dataset [1], and select 41 vertices on the SMPL-X body mesh, which are illustrated in Fig. 2.

In addition, we use 3D locations in Euclidean space to describe these markers. Compared to representing the body with the global translation and the local pose, like in [41, 58], such a location-based representation naturally couples the global body translation and the local pose variation, and hence is less prone to motion artifacts like foot skating, which are caused by the mismatch between the global movement, the pose variation, and the body shape.

Therefore, in each frame the body is represented by a 123-D feature vector, i.e. the concatenation of the 3D locations of these 41 markers, and the motion is represented by a time sequence of such vectors. We denote the input sequence to the model as , and a predicted future sequence from the model as , which has . With fitting algorithms like Mosh and Mosh++ [32, 34], it is much easier to recover 3D bodies from markers than from joints.

Figure 2: Illustration of our marker setting. The markers are denoted by 3D spheres attached to the SMPL-X body surface. From left to right: the front view, the back view and the top view.

3.3 Motion Generator with Latent Frequencies

Figure 3: Illustration of our CVAE architecture. The red arrows denote sampling from the inference posterior. The circles with ‘c’ and ‘+’ denote feature concatenation and addition, respectively. The blocks with ‘fc’ denote a stack of fully-connected layers.

For a real human body, the motion granularity usually corresponds to the motion frequency because of the underlying locomotor system. For example, the frequency of waving hands is usually much higher than the frequency of a jogging gait. Therefore, we design a network with multiple frequency bands in the latent space, so as to better represent interactions between markers on the body surface and to model motions at different granularity levels.

Architectures. Our model is visualized in Fig. 3, which is designed according to the CVAE in the DLow method [55]. The encoder with a GRU [12] preserves full temporal resolution of the input. Then, the motion information is decomposed onto multiple frequency bands via DCT. At individual frequency bands, we use the re-parameterization trick [26] to introduce randomness, and then use inverse DCT to convert the motion back to the temporal domain. To improve temporal smoothness and eliminate the first-frame jump reported in [36], we use residual connections at the output. We note that the CVAE in the DLow method, which does not have residual connections but has a loss to penalize the jump artifact, is not effective to produce smooth and realistic motions in terms of markers. A visualization of this latent DCT space is shown in the appendix.

Training with robust Kullback-Leibler divergence. Our training loss comprises three terms for frame-wise reconstruction, frame-wise velocity reconstruction, and latent distribution regularization, respectively. The formula is given by


in which the operation is to compute the time difference, denotes the inference posterior (the encoder), denotes the latent frequency components, and is a loss weight. We find the velocity reconstruction term can further improve temporal smoothness.

Noticeably, our distribution regularization term is given by the robust KLD [58] with  [11], which defines an implicit latent prior different from the standard normal distribution. During optimization, the gradients to update the entire KLD term become smaller when the divergence from becomes smaller. Thus, it expects the inference posterior to carry information, and prevents posterior collapse. More importantly, this term is highly suitable for our latent DCT space. According to the energy compaction property of DCT [3, 43], we expect that the latent prior deviates from at low-frequency bands to carry information, but equals to at high-frequency bands to produce white noise. We let such robust KLD term determine which frequency bands to carry information automatically. Detailed analysis on the robust KLD term is in the appendix.

Sampling from the latent DCT space. Since our latent prior is implicit, sampling from the latent space is not as straightforward as sampling from the standard normal distribution, like for most VAEs. Due to the DCT nature, we are aware that motion information is concentrated at low-frequency bands, and hence we directly explore these informative low-frequency bands using the network Q in DLow [55].

Specifically, we use to sample from the lowest frequency bands, and sample from from to the highest frequency bands. Since the cosine basis is orthogonal and individual frequency bands carry independent information, these DLow models do not share parameters, but are jointly trained with the same losses as in [55], as well as the decoder of our CVAE with the latent DCT space. Our sampling approach is illustrated in Fig. 4. The influence of the threshold is investigated in Sec. 4 and in the appendix.

Figure 4: Illustration of our sampling scheme to generate different sequences. denotes the network to produce the latent transformation at the frequency band . The red arrows denote the sampling operation.

3.4 Recursive Projection to the Valid Body Space

Figure 5: Illustration of our prediction scheme with projection. The notation has the same meaning as before.

Our generative model produces diverse motions in terms of marker location variations. Due to RNN accumulation error, predicted markers can gradually deviate from a valid 3D body, resulting in e.g. flattened heads and twisted torsos. Existing methods with new losses or discriminators  [22, 17, 27] can alleviate this problem, but may unpredictably fail due to the train-test domain gap.

Instead, we exploit the fact that valid bodies lie on a low-dimensional manifold in the Euclidean space of markers. Whenever the RNN performs a prediction step, the solution tends to leave this manifold. Therefore, at each prediction step, we project the predicted markers back to that manifold, by fitting a SMPL-X body mesh to the predicted markers. Since markers provide rich body constraints, such fitting process can be performed efficiently. Our recursive projection scheme is illustrated in Fig. 5.

Following the work of Mosh [32] and Mosh++ [34], the fitting is optimization-based, and consists of three stages: (1) optimizing the global configurations and , (2) additionally optimizing the body pose , and (3) additionally optimizing the hand pose . At each time , we use the previous fitted result to initialize the current optimization process, so that the optimum can be reached with a small number of iterations. The loss of our optimization-based fitting at time is given by


in which s are the loss weights, denotes the corresponding 41 markers on the SMPL-X body mesh, and denotes the markers predicted by the CVAE decoder. From our recursive projection scheme, we not only obtain regularized markers, but also realistic 3D bodies as well as their characteristic parameters.

4 Experiment222Please see more details in the appendix.

4.1 Datasets

We exploit the AMASS [34] dataset. Specifically, we train the models on the combination of CMU [1] and MPI HDM05 [37], and test models on ACCAD [2] and BMLhandball [21, 20], respectively. The CMU dataset contains 551.56 minutes of motions from 106 subjects, and the MPI HDM05 dataset contains 144.54 minutes of motions from 4 subjects. The ACCAD dataset contains 26.74 minutes of motions from 20 subjects, including walking, running, kicking, and other coarse-grained actions. The BMLhandball dataset contains 101.98 minutes of hand ball playing from 10 subjects, in which most motions are fine-grained such as waving and picking up objects. To unify the sequence length and world coordinates, we canonicalize AMASS as follows: First, we trim the original sequences into 480-frame (4-second) subsequences, and downsample them from 120fps to 15fps. The condition sequence contains 15 frames (1s) and the future sequence contains 45 frames (3s). Second, we unify the world coordinates as in [58]. For each subsequence, we reset the world coordinate to the SMPL-X [39] body mesh in the first frame: The X-axis is the horizontal component of the direction from the left hip to the right hip, the Z-axis is the negative direction of gravity, and the Y-axis is pointing forward. The origin is set to its global translation.

To fairly compare our method with state-of-the-art stochastic motion prediction methods, we additionally perform skeleton-based motion prediction on the Human3.6M dataset [25] and the HumanEva-I dataset [46], following the experimental setting of Yuan et al. [55]. On Human3.6M we use the 17-joint skeleton, perform training on subjects (S1, S5, S6, S7, S8), and perform testing on (S9 and S11). The condition sequence has 25 frames (0.5s), and the predicted sequence has 100 frames (2s). On HumanEva-I we use a 15-joint skeleton. The condition sequence has 15 frames (0.25s), and the predicted sequence has 60 frames (1s).

4.2 Baselines

For fair comparison with state-of-the-art on Human3.6M and HumanEva-I, we use the same setting of the CVAE in DLow [55], except that we add our latent DCT space into the model and train it with the robust KLD term. For sampling, we follow the procedures demonstrated in Sec. 3.3, applying a set of DLow models on the lowest bands. We denote this modified model as ‘VAE+DCT+’, which is used to verify the effectiveness of the latent DCT space. Absence of the suffix ‘+’ indicates sampling from at all frequency bands.

Our MOJO solution works on markers. Since it has several components, we denote ‘MOJO-DCT’ as the model without DCT, but with the same latent space as the CVAE in DLow. We also denote ‘MOJO-proj’ as the model without the recursive projection scheme. Note that these suffixes can be concatenated, so as to denote e.g. ‘MOJO-DCT-proj’ as the model without the latent DCT space and without the recursive projection scheme. Moreover, we use ‘MOJO w. J.’ to denote a traditional pipeline, which first predicts joints and then fits the body mesh to the predicted joints. Its generative model architecture is identical to MOJO.

4.3 Evaluation on Stochastic Motion Prediction

4.3.1 Metrics

Diversity. We use the same diversity measure as [55], which is the average pair-wise distances between all generated sequences.

Prediction accuracy. Same as [55], we use the average distance error (ADE) and the final distance error (FDE) to measure the minimum distance between the generated motion and the ground truth, w.r.t. frame-wise difference and the final frame difference, respectively. Additionally, we use MMADE and MMFED to evaluate prediction accuracy when the input sequence slightly changes.

Motion Frequency. Similar to [22], we compute the frequency spectra entropy (FSE) to measure motion frequency in the Fourier domain, which is given by the averaged spectra entropy minus the ground truth. Therefore, a higher value indicates the generated motions contain more motion details. Note that high frequency can also indicate noise, and hence this metric is jointly considered with the prediction accuracy.

4.3.2 Results

In the following experiments, we generate 50 different future sequences based on each input sequence, as in [55].

Skeleton-based motion prediction. The results are shown in Tab. 1. We can see that our latent DCT space improves on the state-of-the-art. Specifically, the diversity is improved by a large margin, while the prediction accuracies are comparably better than the baseline. Note that direct sampling from the standard normal distribution does not exploit the advantage of the latent DCT space effectively, since the information at low-frequency bands is ignored. Moreover, the diversity increases sublinearly as DLow is applied on more low-frequency bands, which indicates a lower-frequency band carries more information than a higher-frequency band. This fits the energy compaction property of DCT.

Human3.6M [25] HumanEva-I [46]
Pose-Knows [48] 6.723 0.461 0.560 0.522 0.569 2.308 0.269 0.296 0.384 0.375
MT-VAE [53] 0.403 0.457 0.595 0.716 0.883 0.021 0.345 0.403 0.518 0.577
HP-GAN [7] 7.214 0.858 0.867 0.847 0.858 1.139 0.772 0.749 0.776 0.769
Best-of-Many [8] 6.265 0.448 0.533 0.514 0.544 2.846 0.271 0.279 0.373 0.351
GMVAE [14] 6.769 0.461 0.555 0.524 0.566 2.443 0.305 0.345 0.408 0.410
DeLiGAN [19] 6.509 0.483 0.534 0.520 0.545 2.177 0.306 0.322 0.385 0.371
DSF [54] 9.330 0.493 0.592 0.550 0.599 4.538 0.273 0.290 0.364 0.340
DLow [55] 11.730 0.425 0.518 0.495 0.532 4.849 0.246 0.265 0.360 0.340
VAE+DCT 3.462 0.429 0.545 0.525 0.581 0.966 0.249 0.296 0.412 0.445
VAE+DCT+5 12.579 0.412 0.514 0.497 0.538 4.181 0.234 0.244 0.369 0.347
VAE+DCT+20 15.920 0.416 0.522 0.502 0.546 6.266 0.239 0.253 0.371 0.345
Table 1: Comparison between our method and state-of-the-art. The symbol (or ) denotes the results the higher (or lower) the better. Best results are in boldface.

Marker-based motion representation. Here we measure the motion frequency w.r.t. FSE, in addition to the diversity and the prediction accuracy. The model ‘MOJO-DCT-proj’ has a latent dimension of 128. However, ‘MOJO-proj’ suffers from overfitting with the same latent dimension, and hence we set its latent dimension to 16. We use these dimensions in all following experiments. According to Tab. 1, we employ DLow on the 20% (i.e. the first 9) lowest frequency bands. The results are shown in Tab. 2. Similar to Tab. 1, the model with DCT has consistently better performance. Noticeably, higher motion frequency indicates that generated motions contain more details, and hence are more realistic.

ACCAD [2] BMLHandball [21, 20]
MOJO-DCT-proj 25.349 1.991 3.216 2.059 3.254 0.4 21.504 1.608 1.914 1.628 1.919 0.0
MOJO-proj 28.886 1.993 3.141 2.042 3.202 1.2 23.660 1.528 1.848 1.550 1.847 0.4
Table 2: Comparison between generative models for predicting marker-based motions. Best results of each model are in boldface. The FSE scores are in the scale of .

4.4 Evaluation on Motion Realism

4.4.1 Metrics

Rigid body deformation. We measure the markers on the head, the upper torso and the lower torso, respectively. Specifically, according to CMU [1], we measure (‘LFHD’, ‘RFHD’, ‘RBHD’, ‘LBHD’) for the head, (‘RSHO’, ‘LSHO’, ‘CLAV’, ‘C7’) for the upper torso, and (‘RFWT’, ‘LFWT’, ‘LBWT’, ‘RBWT’) for the lower torso. The deformation is evaluated by the variations of pairwise L2 distances of markers. For each rigid body part , the deformation score is calculated by


in which denotes the location of the marker at time ,

denotes the standard deviation along the time dimension, and

denotes averaging the scores of different predicted sequences.

Foot skating ratio. Foot skating is measured based on the two markers on the feet calcaneus (‘LHEE’ and ‘RHEE’ in CMU [1]). We regard that foot skating occurs, when both foot markers are close enough to the ground and simultaneously exceed a speed limit. In our trials, the distance-to-ground threshold is 5cm, and the speed threshold is 5mm between two consecutive frames (i.e. 75mm/s). We report the averaged ratio of frames with foot skating.

Perceptual score. We render the generated body meshes as well as the ground truth, and perform a perceptual study on Amazon Mechanical Turk. Subjects see a motion sequence and the statement “The human motion is natural and realistic.” They evaluate this on a six-point Likert scale from ‘strongly disagree‘ (1) to ‘strongly agree’ (6). Each individual video is rated by three subjects. We report mean values and standard deviations for each method and each dataset.

4.4.2 Results

In the following experiments, we randomly choose 60 different sequences from ACCAD and BMLhandball, respectively. Based on each sequence, we generate 15 future sequences.

ACCAD [2] BMLHandball [21, 20]
Method Head U. Torso L. Torso Head U. Torso L. Torso
MOJO-DCT-proj 76.3 102.2 99.4 86.0 105.3 83.3
MOJO-proj 70.3 80.7 76.7 68.3 77.7 63.2
MOJO-DCT 1.32 34.0 6.97 1.32 43.1 6.97
MOJO 1.30 32.7 6.76 1.40 44.3 7.64
Table 3: Evaluation on rigid part deformation in motion. All scores are in the scale of , and the lower the better. Best results are in boldface.
Method ACCAD [2] BMLHandball [21, 20]
MOJO-DCT 0.341 0.077
MOJO 0.278 0.066
ground truth 0.067 0.002
Table 4: Comparison between methods w.r.t. foot skating. Best results are in boldface.
Method ACCAD [2] BMLHandball [21, 20]
MOJO-DCT 4.151.38 4.001.26
MOJO 4.071.31 4.171.23
ground truth 4.821.08 4.831.05
Table 5: Perceptual scores from the user study. Best results of each model are in boldface. Note that the score ranges from 1 to 6.
Figure 6: Visualization of 3D body motions. Bodies in gray-green and red denote the input and the generated motion, respectively. The solid and dash image borders denote the results from MOJO-DCT and MOJO, respectively.
ACCAD [2] BMLHandball [21, 20]
MOJO w. J. 21.363 1.184 2.010 1.291 2.067 0.185 19.091 0.930 1.132 1.000 1.156 0.205
MOJO 20.676 1.214 1.919 1.306 2.080 0 16.806 0.949 1.139 1.001 1.172 0
Table 6: Comparison between marker-based and joint-based representations. Evaluations are based on the joint locations. BDF denotes the bone deformation w.r.t. meters. The best results are in boldface.

Rigid body deformation. The results are shown in Tab. 3. With the standard RNN inference, we can see that ‘MOJO-proj‘ outperforms ‘MOJO-DCT-proj‘ consistently, which indicates that the latent DCT space captures interactions between markers more effectively. After the recursive projection scheme is employed, we can see the body shape is preserved. This verifies the effectiveness of the projection scheme, which constrains the motion prediction process within a valid body space.

Foot Skating. The results of foot skating are presented in Tab. 4

. The model with DCT produces fewer foot skating artifacts. A probable reason is that high-frequency components in the DCT space can better model the foot movements.

Perceptual studies. The results in Tab. 5 show that MOJO performs slightly worse than MOJO-DCT on ACCAD, but outperforms it significantly on BMLhandball. A probable reason is that most actions in ACCAD are coarse-grained, whereas most actions in BMLhandball are fine-grained. The advantage of modelling finer-grained motion of the DCT latent space is more easily perceived in BMLhandball.

Motion visualization. Figure 6 shows some generated 3D body motions. We can see that motions generated by MOJO contains finer-grained body movements.

4.5 Markers vs Joints

To investigate the benefits of our MOJO solution, here we compare it with its counterpart traditional pipeline MOJO w. J., which is trained with the SMPL [33] joint locations from CMU and MPI HDM05. Thereafter, the ground truth body mesh is fitted to the predicted joints.

For performance evaluation, we re-calculate the diversity and prediction accuracy metrics w.r.t. the joint locations. Also, we measure the bone deformation (BDF) of the eight limb bones, according to Eq. (3) working on joints. For the results from MOJO, we retrieve the corresponding SMPL joints from the SMPL-X body mesh for fair comparison.

We randomly choose 60 sequences from each test set and generate 50 future motions based on each sequence. Results are presented in Tab. 6. We can see their performance is comparable. However, MOJO w. J. can lead to bone deformation, making it fail to fit a character to the joints. Figure 7 shows such an example. At the last predicted frame, the character cannot be fitted in due to unrealistic bone lengths (notice the right calf bone).

Figure 7: Fitting a character to predicted joints. From left to right: The first predicted frame, the middle frame, and the last frame.

5 Conclusion

In this paper, we propose a new method, MOJO, to predict diverse plausible motions of 3D bodies. Instead of using joints to represent the body, MOJO uses a sparse set of markers on the body surface, which provides more constraints and hence benefits 3D body recovery. To better model interactions between markers, MOJO has a new CVAE that represents motion with multiple latent frequencies, in contrast to most existing methods that encode a motion into a single feature vector. To produce valid bodies in motion, MOJO uses a recursive projection scheme at test time. Compared to existing methods with new losses or discriminators, MOJO significantly eliminates rigid body deformation, and does not suffer from domain gap.

Nevertheless, MOJO has some limitations to improve in the future. For example, the recursive projection scheme slows down the inference process, although it is already much faster than MoSh or MoSh++ [32, 34]. Also, the motion realism is still not comparable with the ground truth (see Tab. 4 and Tab. 5), indicating room to improve. Moreover, we will explore the performance of MOJO on other marker settings, or even real markers from mocap data.


We thank Nima Ghorbani for the advice on the body marker setting and the AMASS dataset. We thank Yinghao Huang, Cornelia Köhler, Victoria Fernández Abrevaya, and Qianli Ma for insightful comments and proofreading. We thank Xinchen Yan and Ye Yuan for discussions on baseline methods. We thank Shaofei Wang and Siwei Zhang for their help with the user study and the presentation, respectively.


MJB has received research gift funds from Adobe, Intel, Nvidia, Facebook, and Amazon. While MJB is a part-time employee of Amazon, his research was performed solely at, and funded solely by, Max Planck. MJB has financial interests in Amazon and Meshcapade GmbH.


  • [1] Cmu graphics lab. cmu graphics lab motion capture database. http://mocap.cs.cmu.edu/, 2000.
  • [2] Accad mocap system and data. https://accad.osu.edu/research/motion-lab/systemdata, 2018.
  • [3] Nasir Ahmed, T_ Natarajan, and Kamisetty R Rao. Discrete cosine transform. IEEE transactions on Computers, 100(1):90–93, 1974.
  • [4] Ijaz Akhter, Tomas Simon, Sohaib Khan, Iain Matthews, and Yaser Sheikh. Bilinear spatiotemporal basis models. ACM Transactions on Graphics (TOG), 31(2):1–12, 2012.
  • [5] Emre Aksan, Peng Cao, Manuel Kaufmann, and Otmar Hilliges. Attention, please: A spatio-temporal transformer for 3d human motion prediction. arXiv preprint arXiv:2004.08692, 2020.
  • [6] Emre Aksan, Manuel Kaufmann, and Otmar Hilliges. Structured prediction helps 3d human motion modelling. In

    Proceedings of the IEEE International Conference on Computer Vision

    , pages 7144–7153, 2019.
  • [7] Emad Barsoum, John Kender, and Zicheng Liu. HP-GAN: Probabilistic 3d human motion prediction via gan. In

    Proceedings of the IEEE conference on computer vision and pattern recognition workshops

    , pages 1418–1427, 2018.
  • [8] Apratim Bhattacharyya, Bernt Schiele, and Mario Fritz. Accurate and diverse sampling of sequences based on a “best of many” sample objective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8485–8493, 2018.
  • [9] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black.

    Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image.

    In Computer Vision – ECCV 2016, Lecture Notes in Computer Science. Springer International Publishing, Oct. 2016.
  • [10] Yujun Cai, Lin Huang, Yiwei Wang, Tat-Jen Cham, Jianfei Cai, Junsong Yuan, Jun Liu, Xu Yang, Yiheng Zhu, Xiaohui Shen, et al. Learning progressive joint propagation for human motion prediction. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  • [11] Pierre Charbonnier, Laure Blanc-Feraud, Gilles Aubert, and Michel Barlaud. Two deterministic half-quadratic regularization algorithms for computed imaging. In Proceedings of the International Conference on Image Processing, pages 168–172, 1994.
  • [12] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , pages 1724–1734, Doha, Qatar, Oct. 2014. Association for Computational Linguistics.
  • [13] Qiongjie Cui, Huaijiang Sun, and Fei Yang. Learning dynamic relationships for 3d human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6519–6527, 2020.
  • [14] Nat Dilokthanakul, Pedro AM Mediano, Marta Garnelo, Matthew CH Lee, Hugh Salimbeni, Kai Arulkumaran, and Murray Shanahan. Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648, 2016.
  • [15] Partha Ghosh, Jie Song, Emre Aksan, and Otmar Hilliges. Learning human motion models for long-term predictions. In 2017 International Conference on 3D Vision (3DV), pages 458–466. IEEE, 2017.
  • [16] Anand Gopalakrishnan, Ankur Mali, Dan Kifer, Lee Giles, and Alexander G Ororbia. A neural temporal model for human motion prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12116–12125, 2019.
  • [17] Liang-Yan Gui, Yu-Xiong Wang, Xiaodan Liang, and José MF Moura. Adversarial geometry-aware human motion prediction. In Proceedings of the European Conference on Computer Vision (ECCV), pages 786–803, 2018.
  • [18] Liang-Yan Gui, Yu-Xiong Wang, Deva Ramanan, and José MF Moura. Few-shot human motion prediction via meta-learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 432–450, 2018.
  • [19] Swaminathan Gurumurthy, Ravi Kiran Sarvadevabhatla, and R Venkatesh Babu. Deligan: Generative adversarial networks for diverse and limited data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 166–174, 2017.
  • [20] Fabian Helm, Rouwen Cañal-Bruland, David L Mann, Nikolaus F Troje, and Jörn Munzert. Integrating situational probability and kinematic information when anticipating disguised movements. Psychology of Sport and Exercise, 46:101607, 2020.
  • [21] Fabian Helm, Nikolaus F Troje, and Jörn Munzert. Motion database of disguised and non-disguised team handball penalty throws by novice and expert performers. Data in brief, 15:981–986, 2017.
  • [22] Alejandro Hernandez, Jurgen Gall, and Francesc Moreno-Noguer. Human motion prediction via spatio-temporal inpainting. In Proceedings of the IEEE International Conference on Computer Vision, pages 7134–7143, 2019.
  • [23] Daniel Holden, Ikhsanul Habibie, Ikuo Kusajima, and Taku Komura. Fast neural style transfer for motion data. IEEE computer graphics and applications, 37(4):42–49, 2017.
  • [24] Yinghao Huang, Federica Bogo, Christoph Lassner, Angjoo Kanazawa, Peter V Gehler, Javier Romero, Ijaz Akhter, and Michael J Black. Towards accurate marker-less human shape and pose estimation over time. In 2017 international conference on 3D vision (3DV), pages 421–430. IEEE, 2017.
  • [25] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1325–1339, jul 2014.
  • [26] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
  • [27] Muhammed Kocabas, Nikos Athanasiou, and Michael J. Black. VIBE: Video inference for human body pose and shape estimation. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 5252–5262. IEEE, June 2020.
  • [28] Chen Li, Zhen Zhang, Wee Sun Lee, and Gim Hee Lee. Convolutional sequence to sequence model for human dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5226–5234, 2018.
  • [29] Maosen Li, Siheng Chen, Yangheng Zhao, Ya Zhang, Yanfeng Wang, and Qi Tian. Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 214–223, 2020.
  • [30] Hung Yu Ling, Fabio Zinno, George Cheng, and Michiel Van De Panne. Character controllers using motion vaes. ACM Transactions on Graphics (TOG), 39(4):40–1, 2020.
  • [31] Hongyi Liu and Lihui Wang. Human motion prediction for human-robot collaboration. Journal of Manufacturing Systems, 44:287–294, 2017.
  • [32] Matthew Loper, Naureen Mahmood, and Michael J Black. Mosh: Motion and shape capture from sparse markers. ACM Transactions on Graphics (TOG), 33(6):1–13, 2014.
  • [33] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, Oct. 2015.
  • [34] Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. AMASS: Archive of motion capture as surface shapes. In International Conference on Computer Vision (ICCV), pages 5442–5451, Oct. 2019.
  • [35] Wei Mao, Miaomiao Liu, Mathieu Salzmann, and Hongdong Li. Learning trajectory dependencies for human motion prediction. In Proceedings of the IEEE International Conference on Computer Vision, pages 9489–9497, 2019.
  • [36] Julieta Martinez, Michael J Black, and Javier Romero.

    On human motion prediction using recurrent neural networks.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2891–2900, 2017.
  • [37] Meinard Müller, Tido Röder, Michael Clausen, Bernhard Eberhardt, Björn Krüger, and Andreas Weber. Documentation mocap database hdm05. 2007.
  • [38] Dirk Ormoneit, Hedvig Sidenbladh, Michael Black, and Trevor Hastie. Learning and tracking cyclic human motion. Advances in Neural Information Processing Systems, 13:894–900, 2000.
  • [39] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 10975–10985, June 2019.
  • [40] Dario Pavllo, Christoph Feichtenhofer, Michael Auli, and David Grangier. Modeling human motion with quaternion-based neural networks. International Journal of Computer Vision, pages 1–18, 2019.
  • [41] Dario Pavllo, David Grangier, and Michael Auli. Quaternet: A quaternion-based recurrent model for human motion. arXiv preprint arXiv:1805.06485, 2018.
  • [42] K. Pullen and C. Bregler. Animating by multi-level sampling. In Proceedings Computer Animation 2000, pages 36–42, 2000.
  • [43] K Ramamohan Rao and Ping Yip. Discrete cosine transform: algorithms, advantages, applications. Academic press, 2014.
  • [44] Javier Romero, Dimitrios Tzionas, and Michael J. Black. Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), Nov. 2017.
  • [45] Andrey Rudenko, Luigi Palmieri, Michael Herman, Kris M Kitani, Dariu M Gavrila, and Kai O Arras. Human motion trajectory prediction: A survey. The International Journal of Robotics Research, 39(8):895–935, 2020.
  • [46] Leonid Sigal, Alexandru O Balan, and Michael J Black. Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International journal of computer vision, 87(1-2):4, 2010.
  • [47] Yongyi Tang, Lin Ma, Wei Liu, and Weishi Zheng. Long-term human motion prediction by modeling motion context and enhancing motion dynamic. arXiv preprint arXiv:1805.02513, 2018.
  • [48] Jacob Walker, Kenneth Marino, Abhinav Gupta, and Martial Hebert. The pose knows: Video forecasting by generating pose futures. In Proceedings of the IEEE international conference on computer vision, pages 3332–3341, 2017.
  • [49] Mao Wei, Liu Miaomiao, and Salzemann Mathieu. History repeats itself: Human motion prediction via motion attention. In ECCV, 2020.
  • [50] Xinshuo Weng, Jianren Wang, Sergey Levine, Kris Kitani, and Nick Rhinehart. 4D Forecasting: Sequantial Forecasting of 100,000 Points. ECCVW, 2020.
  • [51] Xinshuo Weng, Jianren Wang, Sergey Levine, Kris Kitani, and Nick Rhinehart. Inverting the Pose Forecasting Pipeline with SPF2: Sequential Pointcloud Forecasting for Sequential Pose Forecasting. CoRL, 2020.
  • [52] Erwin Wu and Hideki Koike. Futurepong: Real-time table tennis trajectory forecasting using pose prediction network. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, pages 1–8, 2020.
  • [53] Xinchen Yan, Akash Rastogi, Ruben Villegas, Kalyan Sunkavalli, Eli Shechtman, Sunil Hadap, Ersin Yumer, and Honglak Lee. Mt-vae: Learning motion transformations to generate multimodal human dynamics. In Proceedings of the European Conference on Computer Vision (ECCV), pages 265–281, 2018.
  • [54] Ye Yuan and Kris Kitani. Diverse trajectory forecasting with determinantal point processes. arXiv preprint arXiv:1907.04967, 2019.
  • [55] Ye Yuan and Kris Kitani. Dlow: Diversifying latent flows for diverse human motion prediction. European Conference on Computer Vision (ECCV), 2020.
  • [56] M Ersin Yumer and Niloy J Mitra. Spectral style transfer for human motion between independent actions. ACM Transactions on Graphics (TOG), 35(4):1–8, 2016.
  • [57] Jason Y Zhang, Panna Felsen, Angjoo Kanazawa, and Jitendra Malik. Predicting 3d human dynamics from video. In Proceedings of the IEEE International Conference on Computer Vision, pages 7114–7123, 2019.
  • [58] Yan Zhang, Michael J Black, and Siyu Tang. Perpetual motion: Generating unbounded human motion. arXiv preprint arXiv:2007.13886, 2020.
  • [59] Yi Zhou, Connelly Barnes, Jingwan Lu, Jimei Yang, and Hao Li. On the continuity of rotation representations in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2019.
  • [60] Yi Zhou, Zimo Li, Shuangjiu Xiao, Chong He, Zeng Huang, and Hao Li. Auto-conditioned recurrent networks for extended complex human motion synthesis. In International Conference on Learning Representations, 2018.

Appendix A More Analysis on the Latent DCT Space

Analysis on the latent space dimension.

First, we use Human3.6M to analyze the latent DCT space, with joint locations to represent the human body in motion. Tab. S1 shows the performance under various settings. Similar to Tab. 1, as DLow is applied on more frequency bands, the diversity consistently grows, and the motion prediction accuracies are stable. Noticeably, VAE+DCT with a 32d latent space outperforms the baseline [55] (see Tab. 1), indicating that our latent DCT space has better representation power.

Method Diversity ADE FDE MMADE MMFDE
(32d)VAE+DCT 3.405 0.432 0.544 0.533 0.589
(32d)VAE+DCT+1 7.085 0.419 0.514 0.515 0.547
(32d)VAE+DCT+5 12.007 0.415 0.510 0.505 0.542
(32d)VAE+DCT+10 13.103 0.417 0.513 0.507 0.544
(32d)VAE+DCT+20 14.642 0.418 0.516 0.510 0.548
(64d)VAE+DCT 3.463 0.429 0.544 0.532 0.587
(64d)VAE+DCT+1 7.254 0.417 0.514 0.513 0.547
(64d)VAE+DCT+5 12.554 0.413 0.510 0.504 0.540
(64d)VAE+DCT+10 14.233 0.414 0.514 0.506 0.546
(64d)VAE+DCT+20 15.462 0.416 0.517 0.508 0.548
Table S1: Model performances with different latent dimensions and number of frequency bands with DLow on the Human3.6M dataset. Best results are in boldface. This table is directly comparable with Tab. 1, which shows the results with the 128d latent space (same with [55]).

Additionally, for the marker-based representation, we evaluate the influence of the latent feature dimension using the BMLhandball dataset. The results are presented in Tab. S2, in which DLow is applied for MOJO-DCT-proj. According to the investigations on the Human3.6M dataset, we apply DLow on the lowest 20% (the lowest 9) frequency bands in MOJO-proj, corresponding to VAE+DCT+20 in Tab. S1. We can see that the MOJO-DCT-proj performs best with a 128d latent space, yet is worse than most cases of MOJO-proj, which indicates the representation power of the latent DCT space. In the meanwhile, different versions of MOJO-proj perform comparably with different latent dimensions. As the feature dimension increases, the diversity consistently increases, whereas the prediction accuracies decrease in most cases. Therefore, in our experiments, we set the latent dimensions of MOJO-DCT-proj and MOJO-proj to 128 and 16, respectively.

Method Diversity ADE FDE MMADE MMFDE FSE()
(8d)MOJO-DCT-proj 0.027 2.119 3.145 2.143 3.153 -2.6
(16d)MOJO-DCT-proj 0.060 2.105 3.134 2.125 3.133 -3.5
(32d)MOJO-DCT-proj 0.152 2.045 3.071 2.065 3.068 -3.9
(64d)MOJO-DCT-proj 17.405 1.767 2.213 1.790 2.219 0.2
(128d)MOJO-DCT-proj 21.504 1.608 1.914 1.628 1.919 0.0
(8d)MOJO-proj 20.236 1.525 1.893 1.552 1.893 -0.8
(16d)MOJO-proj 23.660 1.528 1.848 1.550 1.847 0.4
(32d)MOJO-proj 24.448 1.554 1.850 1.573 1.846 1.0
(64d)MOJO-proj 24.129 1.557 1.820 1.576 1.819 1.0
(128d)MOJO-proj 25.265 1.620 1.852 1.636 1.851 2.5
Table S2: Performances with various latent space dimensions on BMLhandball. Best results of each model are in boldface.
Visualization of the latent DCT space.

Since the latent space is in the frequency domain, we visualize the average frequency spectra of the inference posterior in Fig. S1, based on the VAE+DCT(128d) model and the Human3.6M dataset. We find that the bias of fc layers between the DCT and the inverse DCT can lead to stair-like structures. For both cases with and without the fc layer bias, we can observe that most information is concentrated at low-frequency bands. This fact can explain the performance saturation when employing DLow on more frequency bands, and also fits the energy compaction property of DCT. Moreover, we show the performance without the fc bias in Tab. S3. Compared to the results in Tab. 1, we find that the influence of these bias values is trivial. Therefore, in our experiments we preserve the bias values in these fc layers trainable.

Figure S1: With the VAE+DCT(128d) model, we randomly select 5000 samples from the Human3.6M

training set, and obtain their mean and the logarithmic variance values from the VAE encoder. To show the frequency spectra, we average all absolute values of mean or logarithmic variance. Note that when both mean and logarithmic variance are zero, the posterior is equal to the standard normal distribution, which only produces white noise without information.

Method Diversity ADE FDE MMADE MMFDE
VAE+DCT-fcbias 3.442 0.431 0.547 0.525 0.584
VAE+DCT+1-fcbias 7.072 0.417 0.514 0.506 0.541
VAE+DCT+5-fcbias 13.051 0.413 0.512 0.498 0.537
VAE+DCT+10-fcbias 14.723 0.415 0.515 0.500 0.540
VAE+DCT+20-fcbias 16.008 0.415 0.517 0.501 0.542
Table S3: The model performances with zero values of the fc layer bias. ‘-fcbias’ denotes no bias. The best results are in boldface, and can be directly compared with the results in Tab. 1.

Appendix B More Analysis on the Robust Kullback–Leibler Divergence

The robust KLD term was first proposed in Zhang et al. [58], which is to encourage temporal correlations and overcome posterior collapse. Specifically, it assumes that the latent prior is not a standard normal distribution, namely , with being the inference posterior. Instead, it has


in which is the Charbonnier penalty function,  [11]. Without loss of generality, the feature dimension in and is assumed to be 1D. According to Zhang et al. [58], it has

Proposition 1.

The new KL-divergence in Eq. (4) can: (1) lead to a higher ELBO than its counterpart with a standard normal distribution prior, (2) encourage temporal (or frequency) correlations in the latent space, (3) avoid posterior collapse numerically, and (4) retain a low computational cost.


Due to the re-parameterization trick [26], it has , in which and are derived from the RNN states at each time (or frequency band) . Therefore, the KLD term can be given by


with being the KL-divergence between the posterior and the standard normal distribution at . From the above derivation, the correlation term appears in the formula. Also, with the same generation model and the reconstruction loss, such robust KLD term leads to a higher ELBO.

Since is a scalar function, its computational cost remains low, in contrast to methods with an explicit latent prior, e.g. [Aksan et al., ICLR, 2019] and [Chung et al. NeurIPS, 2015]. In addition, it is noted the derivative of the Charbonnier function is . Consequently, gradients for updating the entire KLD term will get small, when the KLD is small. Numerically, it can alleviate posterior collapse during training. ∎

To investigate its effectiveness on solving posterior collapse, we follow the benchmark proposed by [He et al., ICLR, 2019] and perform evaluations on image density estimation. Note that, with the robust KLD, we simply draw samples from the standard normal distribution during testing. The results are shown in Tab. S4. We can see that our robust KLD method (with the loss weight equal to 1) outperforms most state-of-the-art methods. Although it performs worse than the lagging training scheme[He et al., ICLR, 2019], our robust KLD method is more efficient, like training/testing a standard VAE. More analysis will be conducted in the future.

Method IW -ELBO KL Mutual Info. Active Units
previous reports
VLAE[Chen at al., 2017] 89.23 - - - -
VampPrior[Tomczak & Welling, 2018] 89.76 - - - -
Modified VAE objective
VAE + anneal 89.21(0.04) 89.55(0.04) 1.97(0.12) 1.79(0.11) 5.3(1.0)
-VAE () 105.96(0.38) 113.24(0.40) 69.62(2.16) 3.89(0.03) 32.0(0.0)
-VAE () 96.09(0.36) 101.16(0.66) 44.93(12.17) 3.91(0.03) 32.0(0.0)
-VAE () 92.14(0.38) 94.92(0.47) 25.43(9.12) 3.93(0.03) 32.0(0.0)
-VAE () 89.15(0.04) 90.17(0.06) 9.98(0.20) 3.84(0.03) 13.0(0.7)
SA-VAE-anneal[He et al., 2019] 89.07(0.06) 89.42(0.06) 3.32(0.08) 2.63(0.04) 8.6(0.5)
Lagging[He et al., 2019] 89.11(0.04) 89.62(0.16) 2.36(0.15) 2.02(0.12) 7.2(1.3)
Standard VAE objective
PixelCNN 89.73(0.04) - - - -
VAE 89.41(0.04) 89.67(0.06) 1.51(0.05) 1.43(0.07) 3.0(0.0)
SA-VAE 89.29(0.02) 89.54(0.03) 2.55(0.05) 2.20(0.03) 4.0(0.0)
Lagging 89.05(0.05) 89.52(0.03) 2.51(0.14) 2.19(0.08) 5.6(0.5)
Robust KLD (ours) 89.17(0.03) 89.63(0.03) 3.35(0.03) 2.74(0.07) 6.0(0.6)
Table S4:

Comparison between state-of-the-art solutions to posterior collapse for image density estimation. Details of the evaluation metrics are referred to [He et al., ICLR, 2019].

Figure S2: The interface of the user study.

Appendix C More Experimental Details

Network architectures.

We have demonstrated the CVAE architecture of MOJO in Sec. 3, and compare it with several baselines and variants in Sec. 4. The architectures of the used motion generators are illustrated in Fig. S3. Compared to the CVAE of MOJO, VAE+DCT has no residual connections at the output, and the velocity reconstruction loss is replaced by a loss to minimize  [55]. MOJO-DCT-proj encodes the motion into a single feature vector, rather than a set of frequency components.

Figure S3: From top to bottom: (1) The CVAE architecture of MOJO. (2) The CVAE architecture of MOJO-DCT, which is used in Tables 2 3 4 5. (3) The architecture of VAE+DCT, which is evaluated in Tab. 1. Note that we only illustrate the motion generators here. The recursive projection scheme can be added during the testing time.

We use PyTorch v1.6.0[Paszke et al. 2019] in our experiments. In the training loss Eq. (

1), we empirically set in all cases. We find a larger value causes over-smooth and loses motion details, whereas a smaller value can cause jitters. For the models with the latent DCT space, the robust KLD term with the loss weight 1 is employed. For models without the latent DCT space, we use a standard KLD term, and set its weight to 0.1. The weights of the KLD terms are not annealed during training. In the fitting loss Eq. (2), we empirically set . Smaller values can degrade the pose realism with e.g. twisted torsos, and larger values reduce the pose variations in motion. Our code is publicly released, which delivers more details.

User study interface.

Our user study is performed via AMT. The interface is illustrated in Fig. S2. We set a six-point Likert scale for evaluation. Each video is evaluated by three subjects.