Modelling of human motion over time has a number of applications in activity recognition, human computer interaction, human detection and tracking, and image-based pose estimation in the context of robotics or self-driving vehicles. Humans have the ability to forecast the sequence of poses over short-term horizons with high accuracy and can imagine probable motion over arbitrary time scales. Despite recent progress in data-driven modelling of human motion[7, 8, 14, 20, 25, 33], this task remains difficult for machines.
The difficulty of the task is manifold. First, human motion is highly dynamic, non-linear and over time becomes a stochastic sequential process with a high degree of inherent uncertainty. Humans leverage strong structural and temporal priors about continuity and regularity in natural motion. However, these are hard to model algorithmically due to i) the inter-dependencies between joints and ii) the influence of high-level activities on the motion sequences (e.g., transition from walking to jumping). In fact many recent approaches forgo explicit modelling of human motion  in favor of pure data-driven models [8, 20, 25].
However, it has been observed that quantitative performance does not always translate to qualitatively meaningful predictions [20, 25]. Furthermore, the H3.6M benchmark is becoming saturated, limiting progress. This leads to the two main research questions studied in this work: i) How to measure accuracy of pose predictions in a meaningful way such that low errors corresponds to good qualitative results and how to improve this performance? ii) How to exploit spatial structure of the human skeleton for better predictions?
With respect to i) we note that much of the literature relies on the H3.6M  dataset and an Euler angle based metric as performance measure, evaluated on a limited number of test sequences. While enabling initial exploration of the task, the dataset is limited in size (roughly 3 hours from 210 sequences) and in diversity of activities and poses, which contributes to a saturation effect in terms of performance. In this paper we show that existing techniques do not scale well when trained on larger and more diverse datasets. To this end, we leverage the recently released AMASS dataset 
, itself consisting of multiple smaller motion datasets, offering many more samples (14x over H3.6M) and a wider range of activities. To further unpack the performance of motion modelling techniques, we introduce several evaluation metrics to the task of human motion prediction.
Our main technical contribution is a novel structured prediction layer (SPL) that addresses our second research question. We leverage the compositional structure of the human skeleton by explicitly decomposing the pose into individual joints. The SP-layer models the structure of the human skeleton and hence the spatial dependencies between joints. This is achieved via a hierarchy of small-sized neural networks that are connected analogously to the kinematic chains of the human skeleton. Each node in the graph receives information about the parent node’s prediction and thus information is propagated along the kinematic chain. We furthermore introduce a joint-wise decomposition of the loss function as part of SPL. The proposed layer is agnostic to the underlying network and can be used in combination with most previously proposed architectures.
We show experimentally that introducing this layer to existing approaches improves performance of the respective method. The impact is most pronounced on the larger and more challenging AMASS dataset. This indicates that our approach is indeed a step towards successfully exploiting spatial priors in human motion modelling and in turn allows recurrent models to capture temporal coherency more effectively. We thoroughly evaluate the SP-layer on H3.6M and AMASS. On AMASS, for any base model, any metric, and any input representation, it is beneficial to use the SP-layer. Furthermore, even simple architectures that are outperformed by a zero-velocity baseline  perform competitive if paired with the SP-layer.
In summary, we contribute: i) An in-depth analysis of state-of-the-art motion modelling methods and their evaluation. ii) A new benchmark and evaluation protocol on the recent, much larger AMASS dataset. iii) A novel prediction layer, incorporating structural priors. iv) A thorough evaluation of the SP-layer’s impact on motion modelling in combination with several base models.
2 Related Work
We briefly review the most related literature on human motion modelling focusing on Deep Learning for brevity.
Deep recurrent models
Early work makes use of specialized Deep Belief Networks for motion modelling, whereas more recent works leverage recurrent architectures. For example, Fragkiadaki  propose the Encoder-Recurrent-Decoder (ERD) framework, which maps pose data into a latent space where it is propagated through time via an LSTM cell. The prediction at time step is fed back as the input for time step . This scheme quickly leads to error accumulation and hence catastrophic drift over time. To increase robustness, Gaussian noise is added during training. While alleviating the drift problem, this training scheme is hard to fine-tune. Quantitative and qualitative evaluations are performed on the publicly available H3.6M dataset , with a joint angle data representation using the exponential map (also called angle-axis). The joint-wise Euclidean distance on the Euler angles is used as the evaluation metric. Most of the follow-up work adheres to this setting.
Inspired by , Du  have recently proposed to combine a three-layer LSTM with bio-mechanical constraints encoded into the loss function for pedestrian pose and gait prediction. Like , we also incorporate prior knowledge into our network design, but do so through a particular design of the output layer rather than enforcing physical constraints in the loss function. Similar in spirit to , Ghosh 
stabilize forecasting for long-term prediction horizons via application of dropouts on the input layer of a denoising autoencoder. In this work we focus on short-term predictions, but also apply dropouts directly on the inputs to account for noisy predictions of the model at test time. Contrary to, our model can be trained end-to-end.
. The model is trained auto-regressively, using its own predictions during training. A residual connection on the decoder leads to smoother and improved short-term predictions. Martinez also show that simple running-average baselines are surprisingly difficult to beat in terms of the Euler angle metric. The currently best performance on H3.6M is reported by Wang. They also use a sequence-to-sequence approach trained with an adversarial loss to address the drift-problem and to create smooth predictions. Highlighting some of the issues with the previously used loss,  propose a more meaningful geodesic loss.
In this work we show that sequence-to-sequence models, despite good performance on H3.6M, do not fare as well on the larger, more diverse AMASS dataset. Although augmenting them with our SP-layer boosts their performance, they are outperformed by a simple RNN that uses the same SP-layer. To better characterize motion modelling performance we furthermore introduce several new evaluation metrics.
Jain  propose to explicitly model structural information by automatically converting an st-graph into an RNN (S-RNN). The skeleton is divided into 5 major clusters, whose interactions are then manually encoded into an st-graph. Our model is also structure-aware. However, our approach does not require a coarse subdivision of joints and does not require manual definition of st-graphs. Moreover, our layer is agnostic to the underlying network and can be interfaced with most existing architectures.
Bütepage  propose to encode poses with a hierarchy of dense layers following the kinematic chain starting from the end-effectors (dubbed H-TE), which is similar to our SP-layer. In contrast to this work, H-TE operates on the input rather than the output, and has only been demonstrated with non-recurrent networks when using 3D positions to parameterize the poses.
Structure-aware network architectures have also been used in 3D pose estimation from images[16, 29, 21, 17, 31].  and  both learn a structured latent space.  exploit structure only implicitly by encoding the poses into distance matrices which then serve as inputs and outputs of the network.  and  are closest to our work as they explicitly modify the network to account for skeletal structure, either via the loss function , or using a sequence of LSTM cells for each joint in the skeleton .  introduces many new layers into the architecture and needs hyper-parameter tuning to be most effective. In contrast, our proposed SP-layer is simple to implement and train. We show that it improves performance of several baseline architectures out-of-the-box.
Most work parameterizes joint angles as exponential maps relative to each joint’s parent. Pavllo  show results competitive with the state of the art using quaternions. Their model, QuaterNet, consists of 2 layers of GRU cells and similar to  uses a skip connection. The use of quaternions allows for integration of a differentiable forward kinematics layer, facilitating loss computation in the form of Euclidean distance of 3D joint positions. For short-term predictions, QuaterNet directly optimizes for the Euler-angle based metric as introduced by . We show that QuaterNet also benefits from augmentation with our SP-layer, indicating that SPL is independent of the underlying joint angle representation.
Bütepage [2, 3] and Holden  convert the data directly to 3D joint positions. These works do not use recurrent structures, which necessitates the extraction of fixed-size, temporal windows for training.  and  focus on learning of latent representations, which are shown to be helpful for various tasks, such as denoising, forecasting, or motion generation along a given trajectory .  extends  by applying a conditional variational autoencoder (VAE) to the task of online motion prediction in human-robot interactions. We use the positional representation of human poses to compute an informative metric of the prediction quality. However, for learning we use joint angles since they encode symmetries better and are inherently bone-length invariant.
The goal of our work is to provide a general solution to the problem of human motion modelling. To this end we are motivated by the observation that human motion is strongly regulated by the spatial structure of the skeleton. However, integrating this structure into deep neural network architectures has so far not yielded better performance than architectures that only model temporal dependencies explicitly. In this section we outline a novel structured prediction layer (SPL) that explicitly captures the spatial connectivity. The layer is designed to be agnostic to the underlying network. We empirically show in Sec. 5 and 6 that it improves the performance of a variety of existing models irrespective of the dataset or the data representation used.
3.1 Problem Formulation
A motion sample can be considered as a sequence where a frame at time-step denotes the -dimensional body pose. depends on the number of joints in the skeleton, , and the size of the per-joint representation (angle-axis, rotation matrices, quaternions, or 3D positions), i.e. .
Due to their temporal nature, motion sequences are often modelled with auto-regressive approaches. Such models factorize the joint probability of a motion sequence as a product of conditionals as follows:
where the joint distribution is parameterized by. At each time step , the next pose is predicted given the past poses.
While this auto-regressive setting explicitly models the temporal dependencies, the spatial structure is treated only implicitly. In other words, given a pose vector, the model must predict the whole pose vector at the next time step. This assumes that joints are independent from each other given a particular context (i.e., a neural representation of the past frames). However, the human body is composed of hierarchical joints and the kinematic chain introduces spatial dependencies between them.
3.2 Structured Prediction Layer
To address this shortcoming, we propose a novel structured prediction layer (SPL). This is formed by decomposing the model prediction into individual joints. This decomposition is guided by the spatial prior of the human kinematic chain, depicted in Fig. 1. Formally, is a concatenation of joints :
To interface with existing architectures, the SP-layer takes a context representation as input. Here, is assumed to summarize the motion sequence until time . Without loss of generality, we assume this to be a hidden RNN state or its projection. While existing work typically leverages several dense layers to predict the -dimensional pose vector from , our SP-layer predicts each joint individually with separate smaller networks:
where extracts the parent of the -th joint. Importantly, the full body pose is predicted by following the skeletal hierarchy in Fig. 1 as follows:
In this formulation each joint receives information about its own configuration and that of the immediate parent both explicitly, through the conditioning on the parent joint’s prediction, and implicitly via the context . The joint probability of Eq. 1 is further factorized in the spatial domain:
The benefit of this structured prediction approach is two-fold. First, the proposed factorization allows for integration of a structural prior in the form of a hierarchical architecture where each joint is modelled by a different network. This allows the model to learn dedicated representations per joint and hence saves model capacity. Second, analogous to message passing, each parent propagates its prediction to the child joints, allowing for more precise local predictions because the joint has access to the information it depends on (i.e., the parent’s prediction).
In our experiments (cf. Sec. 5 and 6) we show that this layer improves the prediction performance of a diverse set of underlying architectures across many settings and metrics. One potential reason for why this is the case can be found in the resulting network structure and its implications on network training. Fig. 2 compares our structured approach with the traditional one-shot prediction using a dense layer. Because the per-joint decomposition leads to many small separate networks, we can think of an SP-layer as a dense layer where some connections have been set to zero explicitly by leveraging domain knowledge. This decomposition changes the gradients w.r.t. the units in the hidden layer, which are now only affected by the gradients coming from the joint hierarchy that they model. In the traditional setting, the error computed as an average over all joints can easily be distributed over all network weights in an arbitrary fashion.
3.3 Per-joint Loss
We additionally propose to perform a similar decomposition in the objective function that leads to further improvements. The training objective is often a metric in Euclidean space between ground-truth poses and predictions :
where is a loss function such as an norm. The loss is calculated on the entire pose vector and averaged across the temporal and spatial domain. In our work, we use a slightly modified version that preserves joint integrity:
where the loss is first calculated on every joint and then summed up to calculate the loss for the entire motion sequence. In this work we use the MSE for , but the formulation allows for an easy adaptation of domain-specific losses such as the geodesic distance proposed by .
4 Human Motion Modelling
We now evaluate our SP-layer on the task of human motion modelling. We perform our experiments on two datasets and three different underlying architectures which use three different data representations. In the following we explain the datasets and models in more detail.
Given the small size of H3.6M and the reported variance of test results, we propose to use the recently introduced AMASS dataset  for the motion modelling task. We downloaded the dataset from  as the data from  has not yet been released at the time of this writing. AMASS is composed of publicly available databases, the CMU Mocap database  or HumanEva  and uses the SMPL model  to represent motion sequences. The dataset contains sequences, which comprise a total of frames sampled at 60 Hz. This is roughly equivalent to hours of recording, making AMASS about 14 times bigger than H3.6M ( frames at 50 Hz).
We split the AMASS dataset into training, validation and test splits consisting of roughly , and of the samples, respectively. Similar to the H3.6M protocol, the input sequences are seconds ( frames) and the target sequences are -ms ( frames) long. The H3.6M benchmarks use a total of test samples across categories. This is a relatively small test set and it has been reported to cause high variance . In our H3.6M experiments we use this setting to ensure fair comparison. However, on AMASS we use every frame in the test split by shifting a 2-second window over the motion sequences, which extracts test samples. H3.6M and AMASS model the human skeleton with and major joints, respectively. We implement separate SP-layers corresponding to the underlying skeleton.
The modular nature of our SP-layer allows for flexible deployment with a diverse set of base models. In our experiments, we test the layer with the following three representative architectures proposed in the literature. To ease experimentation with SPL and other base architectures, we make all code and pre-trained models available at https://ait.ethz.ch/projects/2019/spl.
Seq2seq is a model proposed by Martinez , consisting of a single layer of GRU cells. It contains a residual connection between the inputs and predictions. Input poses are represented as exponential maps.
QuaterNet uses a quaternion representation instead [24, 25]. The model augments RNNs with quaternion based normalization and regularization operations. Similarly, the residual connection from inputs to outputs is implemented via the quaternion product. In our experiments, we replace the final linear output layer with our SP-layer and keep the remaining setup intact.
RNN uses a single layer recurrent network to calculate the context , which we feed to our SP-layer. In contrast to the Seq2seq and QuaterNet settings, we represent poses via rotation matrices. To account for the error accumulation problem at test time [7, 8, 14], we apply dropout directly on the inputs. This architecture is similar to the ERD  but is additionally augmented with the residual connection of .
In the SP-layer, each joint is modelled with only one small hidden layer (64 or 128 units) followed by a ReLU activation and a linear projection to the joint prediction. We experiment with different hierarchical configurations in SPL (cf. Sec. 6.3) where following the true kinematic chain performed best. Some models benefit from inputting all parent joints in the kinematic chain compared to using only the immediate parent. Note that we changed existing Seq2seq and QuaterNet models only as much as required to integrate them with SPL. To ensure a fair comparison we fine-tune hyper-parameters like learning rate, batch size and hidden layer units. See appendix Sec. 8.1 for details.
5 Evaluation on H3.6M Dataset
In our first set of comparisons we baseline the proposed SP-layer on the H3.6M dataset using the Euler angle metric as is common practice in the literature.
Let denote a rotation of angle around the unit axis . is the angle-axis (or exponential map) representation of a single joint angle. The Euler angles are extracted from by first converting it into a rotation matrix using Rodrigues’ formula and then computing the angles following . This assumes that follows the z-y-x order. Furthermore, as noted by , there exist always two solutions for , from which  picks the one that leads to the least amount of rotation. The Euler angle metric for time step is then
where are the predicted Euler angles of joint at time . is defined by  and comprises of 120 sequences.
Tab. 1 summarizes the relative performances of models with and without the SP-layer on the H3.6M dataset and compares them to the state of the art. The publicly available Seq2seq  and QuaterNet  models are augmented with our SP-layer, but we otherwise follow the original training and evaluation protocols of the respective baseline model.
Using the SP-layer improves the Seq2seq performance significantly and achieves state-of-the-art performance in the walking category. Similarly, SPL yields the best performance with QuaterNet in short-term smoking and discussion motions and marginally outperforms the vanilla QuaterNet in most categories or is comparative to it. While our SP-layer also boosts the performance of the RNN model in walking, eating and smoking motion categories, performance remains similar for discussion.
6 AMASS: A New Benchmark
In this section we evaluate the baseline methods and our SP-layer on the large-scale AMASS dataset, detailed in Sec. 4.1. The diversity and large amount of motion samples in AMASS increase both the task’s complexity and the reliability of results due to a larger test set. In addition to proposing a new evaluation setting for motion modelling we suggest usage of a more versatile set of metrics for the task.
|Euler||Joint Angle||Positional||PCK (AUC)|
So far, motion prediction has been benchmarked on H3.6M using the Euclidean distance between target and predicted Euler angles [14, 20, 25, 33]. Numbers are usually reported per action at certain time steps averaged over 8 samples . Unfortunately, Euler angles have twelve different conventions (not counting the fact that each of these can be defined using intrinsic or extrinsic rotations), which makes the practical implementation of this metric error-prone.
For a more precise analysis we introduce additional metrics from related pose estimation areas [28, 32, 34]. In order to increase the robustness we furthermore suggest to i) sum until time step rather than report the metric at time step , ii) use more test samples covering a larger portion of the test data set and iii) evaluate the models with complementary metrics. Note that we do not train the models on these metrics; they only serve as evaluation criteria at test time.
Joint angle difference
To circumvent the potential source of error in the Euler angle metric, we propose using another angle-based metric following [11, 32]. This metric computes the angle of the rotation required to align the predicted joint with the target joint. Unlike , this metric is independent of how rotations are parameterized. It is furthermore similar to the geodesic loss proposed by . Let be the predicted joint angle for a given joint, parameterized as a rotation matrix, and the respective target rotation . The difference in rotation can be computed as , from which we construct the metric at time step as follows:
where is the rotation matrix of joint at time . In contrast to we compute the loss on global joint angles by first unrolling the kinematic chain before computing .
Following Pavllo ’s  suggestion, we introduce a positional metric. This metric simply performs forward kinematics on and to obtain 3D joint positions and , respectively. It then computes the Euclidean distance per joint. We normalize the skeleton bones such that the right thigh bone has unit length.
In cases where large errors occur, the value of can be misleading. Hence, following the 3D (hand) pose estimation literature [13, 22, 28, 34], we introduce PCK by computing the percentage of predicted joints lying within a spherical threshold around the target joint position, i.e.
where returns if its input is true, and otherwise. Note that for PCK we do not sum, but average, until time step .
Tab. 2 summarizes the performance of the three model variants, each with and without the SP-layer. We trained the base models with minimal modifications, i.e. design, training objective and regularizations are kept intact. We use angle-axis, quaternion and rotation matrix representations for Seq2seq, QuaterNet, and RNN models, respectively. To make a fair comparison, we run hyper-parameter search on the batch size, cell type, learning rate and hidden layer size.
Unlike on H3.6M, LSTM cells consistently outperform GRUs on AMASS for the Seq2seq and RNN models. Different from , we also train the Seq2seq model by applying dropout on the inputs similar to our RNN architecture. QuaterNet gives its best performance with GRU cells while some fine-tuning for the teacher forcing ratio is necessary.
In all settings, the Seq2seq models fail to give competitive performance on this large-scale task and are sometimes outperformed by the zero-velocity baseline proposed by Martinez . QuaterNet shows a strong performance and is in fact the closest vanilla model to the SPL variants. However, our SP-layer still improves the QuaterNet results further. The contribution of the SP-layer is best observable on the RNN model. With the help of a larger dataset, the proposed RNN-SPL achieves the best results under different metrics and prediction horizons. Fig. 3 compares two baseline methods for millisecond predictions with their corresponding SPL extension for different choices of the threshold . The RNN-SPL consistently outperforms other methods. More results are shown in the appendix Sec. 8.3.
Please also note the complementary effect of the proposed metrics in Tab. 2. The Seq2seq-dropout-SPL model at ms shows a significant improvement () w.r.t. the Euler angle metric, and in fact achieves the best result across all models. However, this is no longer the case when we look at the proposed metrics. The model performs marginally worse than the best performing model, RNN-SPL, in these metrics. The joints closer to the root of the kinematic chain have a much larger impact on the overall pose since wrong rotations propagate to all the child joints on the chain. This effect might be ignored when only local rotations are considered, which is the case for . and account for this by first unrolling the kinematic chain.
6.3 Ablation Study
To study the SPL in more depth we conduct an ablation study presented in Tab. 3. We observe that the main performance boost is achieved by the decomposition of the output layer and the per-joint loss in Eq. (6). While the per-joint-loss alone (i.e., without SPL) is not beneficial on H3.6M, on AMASS its application alone already helps (RNN-PJL). It is also effective on Seq2seq models with noisy inputs, but the performance degrades on vanilla Seq2seq model. In longer-term predictions, SP-layer shows a significant contribution (see Tab. 2). Assuming independent joints without modelling any hierarchy (RNN-SPL-indep.) improves the results further. Introducing hierarchy into the prediction layer either in reverse or random order performs often similar or better. However, introducing the spatial dependencies according to the kinematic chain (RNN-SPL) yields the best results with the exception of the positional metric.
We introduce prior knowledge about the human skeletal structure into a neural network by means of a structured prediction layer (SPL). The SP-layer explicitly decomposes the pose into individual joints and can be interfaced with a variety of baseline architectures. We furthermore introduce AMASS, a large-scale motion dataset, and several metrics to the task of motion prediction. On AMASS, we empirically show that for any baseline model, any metric, and any input representation, it is better to use the proposed SP-layer. The simple RNN model augmented with the SP-layer achieved state-of-the-art performance on the new AMASS benchmark.
We thank the reviewers for their insightful comments and Martin Blapp for fruitful discussions. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme grant agreement No 717054. We thank the NVIDIA Corporation for the donation of GPUs used in this work.
-  (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: §8.1.
Deep representation learning for human motion prediction and classification.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1591–1599. Cited by: §2, §2.
-  (2018) Anticipating many futures: online human motion prediction and generation for human-robot interaction. In 2018 IEEE International Conference on Robotics and Automation, ICRA 2018, Brisbane, Australia, May 21-25, 2018, pp. 1–9. External Links: Cited by: §2.
Learning phrase representations using rnn encoder–decoder for statistical machine translation.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1724–1734. External Links: Cited by: §2.
-  (2008) Guide to the carnegie mellon university multimodal activity (cmu-mmac) database. Robotics Institute, pp. 135. Cited by: §4.1.
-  (2019) Bio-lstm: a biomechanically inspired recurrent neural network for 3d pedestrian pose and gait prediction. IEEE Robotics and Automation Letters (RA-L). Note: accepted Cited by: §2.
-  (2015) Recurrent network models for human dynamics. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, Washington, DC, USA, pp. 4346–4354. External Links: Cited by: §1, §1, §2, §2, §2, §4.2, Table 1.
-  (2017) Learning human motion models for long-term predictions. In 2017 International Conference on 3D Vision, 3DV 2017, Qingdao, China, October 10-12, 2017, pp. 458–466. External Links: Cited by: §1, §1, §1, §2, §4.2.
-  (2015) Learning motion manifolds with convolutional autoencoders. In SIGGRAPH Asia 2015 Technical Briefs, SA ’15, New York, NY, USA, pp. 18:1–18:4. External Links: Cited by: §2.
-  (2016-07) A deep learning framework for character motion synthesis and editing. ACM Trans. Graph. 35 (4), pp. 138:1–138:11. External Links: Cited by: §2.
-  (2018-11) Deep inertial poser: learning to reconstruct human pose from sparse inertial measurements in real time. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia) 37, pp. 185:1–185:15. Cited by: §4.1, §6.1.
-  (2014-07) Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7), pp. 1325–1339. Cited by: §1, §2.
-  (2018) Hand pose estimation via latent 2.5d heatmap regression. In ECCV (11), Lecture Notes in Computer Science, Vol. 11215, pp. 125–143. Cited by: §6.1.
-  (2016) Structural-rnn: deep learning on spatio-temporal graphs. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 5308–5317. External Links: Cited by: §1, §1, §1, §2, §4.1, §4.2, Table 1, §5.1, §6.1, §8.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §8.1.
-  (2018) Propagating LSTM: 3d pose estimation based on joint interdependency. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VII, pp. 123–141. External Links: Cited by: §2.
-  (2015) Maximum-margin structured learning with deep networks for 3d human pose estimation. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 2848–2856. External Links: Cited by: §2.
-  (2015) SMPL: a skinned multi-person linear model. ACM Transactions on Graphics (TOG) 34 (6), pp. 248. Cited by: §4.1.
-  (2019-10) AMASS: archive of motion capture as surface shapes. In The IEEE International Conference on Computer Vision (ICCV), External Links: Cited by: §1, §4.1.
-  (2017-07) On human motion prediction using recurrent neural networks. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, Piscataway, NJ, USA. Cited by: §1, §1, §1, §1, §1, §2, §2, §4.1, §4.2, §4.2, Table 1, §5.2, §5.2, §6.1, §6.2, §6.2, §6.2, Table 2, §8.1, §8.1, §8.1, §8.1, §8.1, Table 5.
-  (2017) 3D human pose estimation from a single image via distance matrix regression. In CVPR, pp. 1561–1570. Cited by: §2.
-  (2018-06) GANerated hands for real-time 3d hand tracking from monocular rgb. In Proceedings of Computer Vision and Pattern Recognition (CVPR), External Links: Cited by: §6.1.
Automatic differentiation in pytorch. In NIPS-W, Cited by: §8.1.
-  (2019) Modeling human motion with quaternion-based neural networks. CoRR abs/1901.07677. External Links: Cited by: §4.1, §4.2, §5.2.
-  (2018) QuaterNet: A quaternion-based recurrent model for human motion. In British Machine Vision Conference 2018, BMVC 2018, Northumbria University, Newcastle, UK, September 3-6, 2018, pp. 299. External Links: Cited by: §1, §1, §1, §1, §2, §4.1, §4.2, Table 1, §5.2, §6.1, §6.1, §6.2, Table 2, §8.1, §8.1, §8.1, §8.1, Table 5.
-  (2010) Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal on Computer Vision (IJCV) 87 (1), pp. 4–27. External Links: Cited by: §4.1.
-  Computing euler angles from a rotation matrix. Note: http://www.gregslabaugh.net/publications/euler.pdf, last accessed 21.03.2019 Cited by: §5.1.
-  (2018) Cross-modal deep variational hand pose estimation. In CVPR, Cited by: §6.1, §6.1.
-  (2017) Compositional human pose regression. In ICCV, pp. 2621–2630. Cited by: §2.
-  (2011) Two distributed-state models for generating high-dimensional time series. Journal of Machine Learning Research 12, pp. 1025–1068. External Links: Cited by: §2.
-  (2016) Structured prediction of 3d human pose with deep neural networks. In Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016, External Links: Cited by: §2.
-  (2017) Sparse inertial poser: automatic 3d human pose estimation from sparse imus. Computer Graphics Forum 36(2), Proceedings of the 38th Annual Conference of the European Association for Computer Graphics (Eurographics), pp. 349–360. Cited by: §6.1, §6.1.
-  (2018-10) Adversarial geometry-aware human motion prediction. In European Conference on Computer Vision (ECCV), Cited by: §1, §1, §2, §3.3, Table 1, §6.1, §6.1, §6.2.
-  (2017) Learning to estimate 3d hand pose from single rgb images. In IEEE International Conference on Computer Vision (ICCV), Note: https://arxiv.org/abs/1705.01389 External Links: Cited by: §6.1, §6.1.
8.1 Architecture Details
The RNN and Seq2seq models are implemented in Tensorflow . For the QuaterNet-SPL model we extend the publicly available source code in Pytorch . Our aim is to make a minimum amount of modifications to the baseline Seq2seq  and QuaterNet  models. In order to get the best performance on the new AMASS dataset, we fine-tune the hyper-parameters including batch size, learning rate, learning rate decay, cell type and number of cell units, dropout rate, hidden output layer size and teacher forcing ratio decay for QuaterNet.
Fig. 5 provides an overview over these models. The SP-layer replaces the standard dense layers, which normally use the context representation , i.e., GRU or LSTM state until time-step , to make the pose vector prediction . The SPL component follows the kinematic chain and uses the following network for every joint:
where the hidden layer size per joint is either or and the joint size is , , or for exponential map, quaternion, or rotation matrix pose representation, respectively (see Tab. 4). Similar to the H3.6M setup [14, 20] we use a -second seed and -milisecond target sequences . The sequence corresponds to the target predictions.
models by using the training objectives as proposed in the original papers. The SPL variants, however, implement these objectives by using our proposed joint-wise loss. After an epoch of training we evaluate the model on the validation split and apply early stopping with respect to the joint angle metric. Please note that the early stopping metric is different than the training objective for all models.
|Euler||Joint Angle||Positional||PCK (AUC)|
We use the rotation matrix pose representation with zero-mean unit-variance normalization, following teacher-forcing training. In other words, the model is trained by feeding the ground-truth pose to predict . The training objective is the proposed joint-wise loss with -norm (see Sec. 3.3 in the paper) which is calculated over the entire seed and target predictions .
We do not follow a sampling-based training scheme. In the absence of such a training regularization, the model overfits to the likelihood (i.e., ground-truth input samples) and hence performs poorly in the auto-regressive test setup. We find that a small amount of dropout with a rate of on the inputs makes the model robust against the exposure bias problem.
The dropout is followed by a linear layer with units. We use a single LSTM cell with units. The vanilla RNN model makes the predictions by using
where . We also experimented with GRU units instead of LSTM cells, but experimentally found that LSTMs consistently outperformed GRUs. Finally, we use the Adam optimizer  with its default parameters. The learning rate is initialized with and exponentially decayed with a rate of at every decay steps.
As proposed by Martinez  we use the exponential map pose representation with zero-mean unit-variance normalization. The model consists of encoder and decoder components where the parameters are shared. The seed sequence is first fed to the encoder network to calculate the hidden cell state which is later used by the decoder to initialize the prediction into the future (i.e., ). Similarly, the training objective is calculated between the ground-truth targets and the predictions . We use the proposed joint-wise loss with -norm.
In our AMASS experiments, we find that a single LSTM cell with units performs better than a single GRU cell. In the training of the Seq2seq-sampling model, the decoder prediction is fed back to the model . The other two variants, Seq2seq-dropout (with a dropout rate of ) and Seq2seq (see Tab. 2 in the paper), are trained with ground-truth inputs similar to the RNN models. Similarly, the vanilla Seq2seq model has a hidden output layer of size on AMASS dataset.
We use the Adam optimizer with its default parameters. The learning rate is initialized with and exponentially decayed with a rate of at every decay steps.
We use the quaternion pose representation without any further normalization on the data . The data is pre-processed following Pavllo ’s suggestions to avoid mixing antipodal representations within a given sequence. QuaterNet also follows the sequence-to-sequence architecture where the seed sequence is used to initialize the cell states. As in the vanilla model, the training objective is based on the Euler angle pose representation. More specifically, the predictions in quaternion representation are converted to Euler angles to calculate the training objective.
The model consists of two stacked GRU cells with units each. In contrast to the RNN and Seq2seq models, the residual velocity is implemented by using quaternion multiplication. Moreover, the QuaterNet model applies a normalization penalty and explicitly normalizes the predictions in order to enforce valid rotations. As proposed by Pavllo , we exponentially decay the teacher-forcing ratio with a rate of . The teacher-forcing ratio determines the probability of using ground-truth poses during training. Over time this value gets closer to and hence increases the probability of using the model predictions rather than the ground-truth poses. Similar to the vanilla RNN and Seq2seq models, a hidden output layer of size performed better on AMASS dataset.
Finally, the model is trained by using the Adam optimizer with its default parameters. The learning rate is initialized with and exponentially decayed with a rate of after every training epoch.
8.2 Long-term Prediction on AMASS
In Tab. 5, we report longer-term prediction results as an extension to the results provided in Tab. 2 in the main paper. Please note that all models are trained to predict -ms. In fact, the Seq2seq and QuaterNet models have been proposed to solve short-term prediction tasks only.
Consistent with the short-term prediction results shown in the main paper, our proposed SP-layer always improves the underlying model performance. While QuaterNet-SPL is competitive, RNN-SPL yields the best performance under different metrics.
In Fig. 6 we show more qualitative results for QuaterNet and Seq2seq when augmented with our SP-layer. Please refer to the supplemental video for more qualitative results.
8.3 PCK Plots
We provide additional PCK plots for , , and ms prediction horizon in Fig. 7. Please note that shorter time horizons do not use the entire range of thresholds to avoid a saturation effect.
8.4 Ablation Study
|Euler||Joint Angle||Positional||PCK (AUC)|